implementing an abstract compressed trie to do the search
Assignment Two COMP3506/7505 Semester 2, 2018 Assignment 2 30 Marks Due: Monday, 15 October 2018 at 6:00pm Introduction Searching is a common process in many applications, with textual searches being a very common type of search. Search engines, query engines and many data mining applications are examples of textual search applications. These applications make use of sophisticated algorithms and data structures to provide efficient search behaviour. In this assignment you will implement a general textual search application. The application is to be optimised to perform many complex searches on a single document or collection of documents. Details You are provided with three files representing the document to be searched, an index of sections in that document and a set of stop words. The document to be searched (shakespeare.txt) is a large text file containing the complete works of William Shakespeare. The index of sections (shakespeare-index.txt) contains the titles of the works included in the document and the line number on which they start. The set of stop words (stop-words.txt) contains words that are to be ignored in a logical search. All three files are encoded in UTF-8 format. When searches are performed on the document, the results need to indicate the position at which the search term(s) appear in the document. The position is indicated by the line number on which the term appears and the column number of the first character of the term. Searches will only be for English words and will ignore the case of the words in either the document or search term. (e.g. If the document contained the text “Well, come again tomorrow.”, a search for the phrase “well come again tomorrow” would match it.) When determining the text that makes up a single word in the document, most punctuation marks can be ignored. The exception is for an apostrophe in the middle of a word, such as for a possessive noun or a contraction. (e.g. “I’ll” would be stored as the word “i’ll”, and “honour’s” as “honour’s”.) Apostrophes, or other punctuation, at the start or end of a word can be ignored. (e.g. “'tis” would be stored as the word “tis”, and “years'” as “years”.) The file shakespeare-index.txt contains the titles and line numbers of each of the works in the shakespeare.txt file. The titles and line numbers are separated by a comma. Your application needs to use this index file to determine if a search term appears in a particular work of Shakespeare. For example, searching for “to be or not to be” in “the tragedy of hamlet”, should return that it is on line 25,779 and column 1. Whereas, searching for “to be or not to be” in “the tragedy of romeo and juliet”, should return that no match was found. The titles should not have punctuation removed and the full title, exactly as it is in the index file, is needed when identifying a section in a search. The stop-words.txt file contains a list of words that are used too commonly in English text to be useful as parts of search terms. This file contains one stop word per line in the file. Your application needs to store all of the stop words found in the document, as there may be situations when a search needs to include them. But, most searches you implement should ignore any of the stop words that appear in the search term. The only search that will not ignore the stop words is the search for a phraseOccurrence. For example, the famous phrase “to be or not to be” in Hamlet would not be able to be found if you ignored the stop words, as almost the entire phrase consists of stop words. COMP3506/7505 Semester 2, 2018 Your application needs to store the data in appropriate data structures allowing efficient searching. Part of your mark will be based on the data structures you use and your justification of your choices. Your application should be designed with the assumption that the document will be searched many times and that searching needs to return a result as fast as possible. You should process the document’s text to optimise search performance. Searching is expected to occur in main memory. Your application should be designed so that it will work with any text document that has an associated index in the format described above. The application should also work with any list of stop words. Testing of your application may use a different text document and index file, or a different set of stop words. Your application needs to implement the following searches: 1. Count the number of occurrences of a word. 2. Search for a single word. 3. Search for a prefix of a word. (e.g. Searching for “obscure” would find “obscure”, “obscured”, “obscures” and “obscurely”. 4. Search for a phrase, where a phrase is a sequence of words that are contiguous (e.g. “to be or not to be”). 5. Implement boolean search logic for the operators AND, OR and NOT. There are two parts to the fifth search. One part is to find all lines which match the search criteria. The second part is to find all occurrences of the terms in the search criteria within one or more sections of the document. So that you do not need to implement parsing of search strings, you are provided with an interface called Search and a stub class AutoTester that implements it. These will be used to automate the testing of your application’s functionality. Your AutoTester class will need to implement the constructor stub and override the interface’s methods. These methods should call the related functionality in your assignment and return the results defined in the Search interface’s JavaDoc. If you do not implement some searches, the methods related to these should not be implemented in your AutoTester class. You are provided with two simple entity classes called Pair and Triple that are used by Search to return results from various methods. You may implement a user interface for the assignment if you would like to do so. This will not be executed by markers and will not be considered in assessing your mark for the assignment. Hints The intention of the assignment is to determine not just your ability to implement sophisticated data structures and their related algorithms. It is to determine your ability to select data structures and algorithms that are suitable for complex tasks. This may involve using a combination of data structures and algorithms to provide a good solution. Section 13.3.4 in the textbook describes an approach to implementing a search engine. You may find it helpful in providing ideas for this assignment. Constraints You may not move any provided classes or interfaces to different packages. You may not add, remove or change anything in the Search interface. You may not add, remove or change anything in Pair or Triple. You may not move the text files out of the files directory and you may not change the names of these files. COMP3506/7505 Semester 2, 2018 You may not use anything from the Java Collections Framework in the JDK, or any other Java data structures library, to implement the Abstract Data Types (ADTs) or Concrete Data Types (CDTs) in your assignment. You may implement the Comparable, Comparator, Iterator or Iterable interfaces for use in your ADTs or CDTs. Your implementation of the data structures in the assignment must be based on basic Java language constructs and not libraries. See: https://docs.oracle.com/javase/8/docs/technotes/guides/collections/overview.html The restriction of not using anything from the Java Collections Framework (JCF) does not apply to your implementation of AutoTester. Some of its methods return results that are types from the JCF. Your implementation of these methods will need to put the results found from your data structure searches into an appropriate JCF concrete type. Analysis In the JavaDoc comments for your CDT classes indicate the run-time efficiency of each public method. Use big-O notation to indicate efficiency (e.g. O(n log n)). This should be on a separate line of the method’s JavaDoc comment, as the last line of the description and before any @param or @return tags. When determining run-time efficiency of a method, take into account any methods that it may call. In the JavaDoc comments for your CDT classes indicate the memory usage efficiency of the data structure. This should be in the JavaDoc comment for the class and should be on a separate line, which is the last line of the description and before any @author, @see or any other tags. You must provide an explanation as to how you determined the run-time and memory usage efficiencies. In a file called readme.pdf, provide a justification for the algorithms and data structures that you used in implementing the application. The justification should explain why the algorithms and data structures that you implemented were appropriate in the context of efficiently performing many different searches on the same document in memory. COMP7505 Extension Students who are studying COMP7505 must complete a further research component to this assignment. This extension is worth an additional 10 marks. Your final mark displayed on BlackBoard’s Grade Centre will be scaled to be out of 30. If a student studying COMP3506 completes this extension it will not be marked. Pre-processing the text in the document to optimise searching is an expensive process. You are to research options of how to minimise this cost. For example, can the results of the pre-processing be saved to a file that can then be loaded directly into the data structures used for searching? You are to research possible approaches and write a report describing one of these approaches. The description should briefly summarise why the approach was an appropriate selection in comparison to other approaches. Your report should analyse the run-time efficiency of the selected approach and compare this to the run-time efficiency of pre-processing the document. You should draw a conclusion as to whether the selected approach is a better alternative than pre-processing the document or not. You may attempt to implement the approach in your application. If you do so, the AutoTester constructor should check the documentFileName to determine if it has been previously pre- processed. If it has been pre-processed then the