Specifications for the Data Analysis assignment Specification This assignment will involve you acquiring some text data into a collected set of texts written by at least three different authors. You...

social media assignment



Specifications for the Data Analysis assignment   Specification This assignment will involve you acquiring some text data into a collected set of texts written by at least three different authors. You will perform an authorship attribution task, using the method you selected as a result of your Investigation in assignment 1. You should work as follows: acquire a set of texts by a number of authors where you know the authorship of all texts. You should have at least 100 texts and no fewer that three known authors; divide your set of texts so that half of them are in the 'known' and the other half are in the 'unknown' subsets: a subset of 'known' writings from which you characterise each of the authors (the training set); and a subset of 'unknown' writings where you use your preferred method to decide which author wrote each of the texts in your 'unknown' set (the test set). implement your preferred method to operate over your set of texts; assign the 'unknown' texts to an author; calculate the error rates, using the following classifications: true positive: how many of the unknown texts were correctly assigned to an author; false positive: how many of the unknown texts were wrongly assigned to an author; judge whether your selected method was as good as you hoped, and what problems might have caused false positives. Would you use it again? You can source some text data from an external source, or you are welcome to use the sources provided on the course webpage. Your 'unknown' texts are not really unknown, but you are going to pretend they are and see if your selected method is able to assign the correct author.   Structure for your Data Analysis submission 1. State where you sourced your texts, and why you selected this source. State how many different authors are represented in the set of texts. Describe the main characteristics of the texts, for example, are they short or long, are they in multiple languages, do the texts have any interesting features (e.g. chapters, hashtags, etc.) (7 marks). 2. Describe how many texts are in the 'known' and the 'unknown' subsets, and how many are from each author in each subset. State how you allocated texts to the subsets - was it random, alternating, or what? Represent this information in a table with author names and total of all authors in the rows versus numbers of 'known' and 'unknown' and a total number of texts for each author in the columns. Use the following table structure (6 marks): 3. Name your chosen method for attributing the 'unknown' texts, and describe your implementation. You may use diagrams. There should be enough detail that the reader would be able to reproduce your implementation (7 marks). 4. Perform the authorship attribution over the 'unknown' subset of texts. Show the exact numbers of true positives and false positives, both in total and for each of the authors. Summarise this information in a table with the author names in the rows plus a 'total' row representing all authors, versus false positive and true positive columns for both absolute numbers and percentages. Obviously the percentages should add up to 100 for each row. Use the following table structure (7 marks): 5. Write a short discussion about whether your method lived up to your expectations and what might have compromised its performance. State whether you would use this method again or whether you would try a different one, Explain your decision (8 marks).
Jun 01, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here