Data Mining IS421 – Assignment 1 Solve All the Following Questions: 1) Question One I. Given the x, y coordinates of the following data points: (0,1);(1,3);(4,11);(5,12);(6,13);(9,21);(10,23) 1) Compute the mean, median and mode of the y values, and what does every measure tell us? 2) Compute the standard deviation of y values, what does the standard deviation tell us? 3) Compute the IQR (Inter Quartile Range) for the y values and state what does it mean? 4) Compute the coefficient of Variation between x and y and show what does it mean to the data miner? 5) Find Mean deviations of x values, and state what does it tell the data miners? 6) For y data values, what is the best tendency measure expressing these data values? II. Derive or (state with a logical reason) a relationship between the following tendency measures: a) Mean, Mode, and Median, stating which measure is valuable for the tendency measure between data points when we know that; these data points contain outliers? b) Arithmetic, Harmonic, and Geometric Mean; giving a real life and practical example where each measure is the most reasonable one? c) Explain how to modify mean to solve the problem of outliers. d) Differentiate between the tendency and dispersion measures from the following perspective: i. The meaning of each type ii. Why we need to compute the tendency and dispersion measures iii. Which one is more reasonable when we that our data set have noise and/or outliers? Data Mining IS421 – Assignment 1 2) Question Two I. What is the difference between disjoint and independent events? Are disjoint events are independent? justify your answer II. A box contains 8 red, 3 white, and 9 blue balls. If three balls are drawn at random without replacement, determine the probability that: a. All three are red b. Two are red and one is white c. All three are white d. At least one is white e. One for each color is drawn f. Repeat the experiment if we draw with replacement and answer the questions from (a…to e). III. Three items are taken at random from a box of 12 items and inspected. The box is rejected if more than 1 item is found to be faulty. If there are 3 faulty items in the box, find the probability that the box is accepted. IV. Four machines A, B, C, and D produce 10, 20, 25, and 40% of the output of certain product. The probability of defective product from each machine is 0.03, 0.05, 0.02, and 0.01 respectively. If a randomly chosen item dfxis defective, what is the probability that it was produced by D? V. Four cards are randomly drawn, with replacement, from a deck of 52. Find the probability that the cards chosen in order are a king, a queen, a number seven, and a heart. VI. A box contains 2000 components of which 5% are defective; a second box contains 500 components of which 40% are defective. Two other boxes where each box contains 1000 components each with 10% defectives. We select one of the boxes at random and randomly select an individual component from the box. What is the probability that the component is defective? VII. Assume a 50% chance that a new born child will be boy in a family of 3 children, find the probability that: a. There are exactly two boys b. All are girls c. Not all are girls d. At least one girl Data Mining IS421 – Assignment 1 3) Question Three Using boxplots, detect the outliers in the following data set: 12, 13, 17, 8, 7, 15, 40, 30, 35, 70, 0 And then answer the following questions: ? Are the detected outliers seem to be noisy data? ? In the previous example, Can you detect the noisy data, if yes, list them? ? Sheet should be performed individually, ? No late submissions will be accepted ? Cheating will be punished using IS cheating policy. ? Please clarify any ambiguity any TA of the subject ? Due Date: Week starting 29-10-2011 on your lab time ? The assignment’s grade will be scaled to 6 grades.