Please review and then let me know if you will be able to accurately write all those commands, and provide all required screenshots with explanations. This course is called DBST 667: Data Mining.
8 Due date: Saturday October 10th 2020. Section 1 This section involves you running some R commands, and providing screenshot for all command output. Then at the end, generate an R script for all commands in this lab. All commands in the R script must have descriptive comments about what the command does. Then you are to submit the R script alongside with this word document that contains all your commands, screenshots, and explanations to all included questions. All commands in this exercise must be executed using Rstudio only. Deliverables: Two Files: (1) Submit this lab report with answers to all questions including output screenshots. (2) Submit an R script that contains all commands with comments that briefly describe each commands purpose. Grading: This exercise is worth 80% of the course grade. All questions must be answered in your own words with any paraphrased references properly cited using in-text citations and a reference list as needed. In addition, grammatical and spelling errors may affect the grade. Part 2 – Run an exercise on the Vehicle Solhouettes dataset from vehicle.csv, completing this report and providing the commands, output screenshots, and discussion/interpretation as requested. Ensure that all commands are saved in this report AND in an R script. For Reference: UCI Machine Learning Repository: Vehicle Silhouettes a. Introduction: i. Based on what you have learned this week about k-means clustering, provide a one-paragraph masters-level response describing what you anticipate that the kmeans method will accomplish for the Vehicle Silhouettes data? Be specific about the behavior and output structure of k-means models. b. Data Pre-Processing: Load the Vehicle Silhouettes data into R Studio using the read.csv command (do not use File > Import Dataset > From CSV in the R Studio GUI as this uses read_csv() resulting in significant different variable types!!!). i. Make a copy of the loaded Vehicle Silhouettes data you just imported and name the copy ‘myvehicle’. Keep the original import as you will need both the original and copy to complete this report. Include the command demonstrating this step below. Command: > ii. Remove the variable class from ‘myvehicle’. Include the command and answer to the question below. Command: > Output: Why do we need to remove the class variable as part of the data preprocessing steps for k-means clustering? iii. Run the scale() function on ‘myvehicle’. Include the command and answer to the question below. (Note: This command is NOT part of your tutorial. Consult the function help and use the default arguments. Hint: scale() is a function that outputs its results. You MUST save the scaled output back to the original ‘myvehicle’. Command: > Why must we scale data as part of the data preprocessing steps for k-means clustering? iv. What additional data preprocessing steps (if any) did you need to execute? Include the command(s) and output screenshot below. Command(s): > Output: c. K-Means Clustering – Running the Method (Hint: Record your results with k=4 in the table in part f): i. Run ‘set.seed(12345)’ and then run the kmeans method with k=4 and store the output to a variable named ‘kc’. Include the command, output screenshot, and discuss the input parameters you used. Command: > Output: Discussion: ii. Enter ‘kc’ at the prompt. Provide the output below and then answer the following questions: Output: How many instances are in each cluster? What information does the cluster means section provide and how were those numbers obtained? What is the clustering vector? What is the sum of squares by clusters and what does it mean? iii. Run the ‘kc$iter’ command. Include the command, output screenshot, and explain what the output shows. Command: > Output: Discussion: d. K-Means Clustering – Evaluate the Model: i. Build the cross-tabulation to compare how the method clustered the vehicles from ‘myvehicle’ to the actual vehicle class from your original import. Include the command, output screenshot, and answer the following questions: Command: Output: What is the dominant vehicle class in each cluster? What is the dominant cluster for each vehicle class? What percentage of vehicles were clustered in agreement with the actual class? e. K-Means Clustering – Cluster Visualization: i. Run the ‘clusplot(kc)’ function to visualize your model. Modify the plot appearance to make your visualization clear and easy to interpret. Unlike previous exercises, your visualization will now be evaluated on clarity and aesthetics in addition to the standard command, output, and interpretation evaluation. Include the full command, output screenshot (zoomed in), and a one-paragraph, masters-level response with your interpretation of your plot. (Hint: Your interpretation should discuss all of the visualized clusters and should begin to address specific observations (data points) within each that warrant discussion.) Command: > Output: Interpretation: f. K-Means Clustering – Experiment with Different K Values (3 Runs Summarized): i. Completely fill in the table below documenting the results of your experimentation with modifying the k value. You may use any k value other than 4 that is greater than 0. You do not need to provide any commands or output screenshots in this report. However, you will be evaluated on these commands being present in your R script! k= Number of Instances in Each Cluster Between Clusters Sum of Squares Within Clusters Sum of Squares Number of Iterations ii. What effect do you observe that modifying the k values has on the method results? Provide a one-paragraph, masters-level response below: iii. What is an ideal value of k for the Vehicle Silhouettes data? This is a subjective and open-ended question. Challenge yourself and come up with a creative and well-supported answer for which value you believe is ideal. Provide a one-paragraph, masters-level response below: g. Summary: i. What differences between k-means clustering and classification methods did you observe? Provide a one-paragraph, masters-level response. References Section 2 Deliverables: One File: (1) Submit this lab report with answers to all questions including output screenshots. Grading: This exercise is worth 80% of the course grade. All questions must be answered in your own words with any paraphrased references properly cited using in-text citations and a reference list as needed. In addition, grammatical and spelling errors may affect the grade. The purpose of this exercise is to make practical sense of text mining. This assignment DOES NOT require using R Studio. The assignment consists of three parts below. Put all of your answers in the spaces provided. Answer all questions in your own words. This assignment is designed to be free from the need for external research. Should the need arise to include these, ensure that you properly cite and attribute all non-original content. Part 1 – Tag Clouds a. Visit TagCrowd and: i. Create a tag cloud for a document or web page of your choice. Explain what text or page you chose, why, and then provide a one-paragraph, masters-level discussion about what the tag cloud reveals about your selection. Include your tag cloud as a screenshot below followed by your interpretations explaining the items noted above. Output Screenshot: https://twitter.com/barackobama Interpretation: ii. Adjust the ‘Options’ available on the TagCrowd page to finely tune your tag cloud. The objective is to provide a more concise and clear tag cloud that more accurately depicts the source content. Post your revised tag cloud screenshot below followed by a discussion on what options you selected and why. (Hint: You should go through multiple iterations changing things like the maximum number of words, minimum frequency, group similar, and don’t show these words to arrive at your revised cloud. This is especially true if you select a web page as there are banners, headers, footers, and other content that does not relate to the primary content of the page.) Output Screenshot: Interpretation: iii. What visual advantages does a tag cloud provide? What about disadvantages? Provide at least one example of each in your one-paragraph, masters-level response. Part 2 – Yippy Search a. Visit Yippy Search and: i. Run a search on a topic of your choice. Include a screenshot of the results page and explain in detail how the results are presented. (Hint: Do not simply look at the main results page but also the sidebar results.) Output Screenshot: Discussion: ii. How is text mining used by Yippy Search? Provide a one-paragraph, masters-level response. (Hint: You do not need external research for this response. Interpret the output and then provide your rationale, based on the readings this week, to draw your conclusions). Discussion: iii. Describe at least one scenario where you could imagine that Yippy Search could be helpful. Provide a discussion below justifying why you believe your scenario would benefit from Yippy Search. Provide a one-paragraph, masters-level response. Discussion: Part 3 – Open Calais a. Visit Open Calais and: i. Copy and paste or provide a PDF of any large body of text with references to people, places, organizations, and things into the appropriate options on the Open Calais page. Include an output screenshot of the results page* and then provide a brief discussion about why you selected this particular text. (*Hint: You need only to take a screenshot of the immediate output window, i.e. you do not need to provide a screenshot showing all scrolled content in the case of massive text). Output Screenshot: Discussion: ii. From a textual data mining perspective, what does the Open Calais tool accomplish? Highlight at least three things that the tool reveals about the text you analyzed. Provide a one-paragraph, masters-level response. Discussion: iii. Describe at least one scenario where you, personally within either an academic or professional setting, would use this tool. Provide a one-paragraph, masters-level response that details the specific application of the Open Calais tool and why it may be advantageous to use. Discussion: References