hithis is text processing and information retrieval subject
Hints As discussed in today's class - hints for project for the three sub tasks 1 - get a search engine going e.g. do something in NLTK or Lucene or install an out of the box search engine like Elastic Search or SOLR (Lucene/Java) or LUCY (Lucene/C) or Sphinx (C++/MySQL) or Xapian (C++) or Lemur (Java/C++) or .... 2. get a corpus of documents installed e.g. ANC or BNC (in V or Wikipedia (in V or news/wikip sample from Leipzig ( a dozen 1M samples in V You should see V:\CSEMShare when logged in here, and you can mount it on your own PC (go to third for help desk if you can't figure how). The above corpora plus some dictionaries are loaded under the Natural Language folder. 3. critique what it does, propose improvements and do something innovative to implement one of these e.g. A. preprocess the input and improve recall (e.g. expanding document keywords using WordNet or LDA) B. preprocess the query to deal with advanced queries, natural language, topic filters/choices, etc C. postprocess the results to display in a clustered way, e.g. using an expanding hierarchy going or a visualization Guidelines by tutor You have to implement a search engine (using toolkits, languages, APIs as you choose) that includes some innovative feature. About 30% is for the basic search engine (as this is just a matter of choosing an open source engine or a toolkit with a sample search engine). About 30% is about the innovative concept and proposed design (irrespective of whether you get it working). And the rest is for implementing and evaluating it and reporting on it. The innovative feature could relate to any of many possible directions: - language technology - how can syntax and/or semantics help - multilingual/modal - query in your language/modality, search index/text, return results in original modality - spidering and/or corpus indexing - innovative ideas at this level (inc. adding syntactic/semantic/MM tags) - clustering/visualization - group and characterize related hits and allow users to restrict/dig one of interest Multilingual/multimodal is where the global interest is today - I am involved in several projects Defence - looking for information about topics given in English, whatever language they are expressed in YourAmigo/YourAnswer - shopping by speech on your phone instead of by typing and clicking Biomedical - typing/querying without hands or voice - using eyegaze to query/navigate/summarize I have some corpora I’ve bought, but the easiest to access and download is wikipedia. Google makes available Ngrams (≤5) for a billion words of web text and a million books (various languages). Groups of students can share the same infrastructure (basic search engine/corpus/index) and work on different modules (e.g. a speech input module using Google or Amazon APIs, a visualization module, …) Suggested Assignment Ideas and Expected Marking Schedule The table below illustrates a number of generic projects and quantifies the proportion of the total mark you might achieve by subtask. To achieve the full mark for any particular box, your deliverable must be fully functional and of a high quality. We judge high quality based on the robustness of solution to the information retrieval problem you are tacking. While high aesthetic quality is not a requirement, you may benefit from application of basic principles of colour usage (look up via Google: Stephen Few ‘Practical Rules for Using Color in Charts’; ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps by Harrower and Brewer 2003). Do not sacrifice system functionality or solution robustness for aesthetics – we are fully aware of the ‘halo effect’ i.e. the observation that aesthetically pleasing systems will be rated more favourably over functionality superior but aesthetically inferior systems. We will award the minimum mark to a system that is mostly functional but is clearly un-polished, lacking novelty, lacking robustness (of solution), or does not exceed expectations of functionality. Specific academic staff or research students will offer specific projects/applications - these will relate to presentations in the first couple of weeks. These include applications with a focus on textual input or databases, as well as projects with a focus on multimedia including audio (music/speech), images (scene/face/expression/print/handwriting) and videos (scene/actor/affect/interaction) and their combination (e.g. movie with video/audio/captions). Sometimes the focus is interpretation rather than retrieval (what is the affect or emotional content or effect, e.g. is the entity on social media human or computer (bot), innocuous or predatory (criminal), etc. Sometimes the focus is exploitation of an online resource, financial or weather data, encyclopaedic or dictionary data. Sometimes the focus is on supporting creative activity, writing novels, papers, theses, or undertaking research or experiments. Sometimes the focus is on improving the interface and user experience/success in information retrieval and understanding. Sometimes the goal will be a step back from a proposed implementation, e.g. requirements analysis or human factors evaluation. Note that you can boost your marks by completing one of the larger tasks and a few minor tasks. The larger tasks will often require completion of other minor tasks but we still expect that these be completed to the highest standard. For example, if you choose to tackle the major ranking task, you will need to complete the minor text operations task also. To achieve marks for both tasks, you will need to expand the index’s fields markedly. This will require additional text-processing routines to extract the necessary Metadata and modifications to the indexing routine to populate those fields. Different fields may necessitate different indexing parameters, which you can learn about through the relevant Lucene API. The work you put in to this task will flow on through to your major task (e.g. exploration and comparison of different ranking algorithms). Component # Modification / Contribution Predicted Mark (Max 100) Query Operations Q1 Implement query weight scheme that takes into consideration word capitalisation, word order, word length, or stop words 10 Q2 Implement (Q1). In addition, Implement advanced search form to allow searching on a wide variety and combinations of text and non-text index fields (search on text, title, and author – what ever you put in the index). Provide form fields for range queries, title field searches, file type(and other specific field) searches etc. 20 Q3 Implement (Q2). Implement Query Expansion using WordNet like thesaurus. Provide facility for user to exclude expanded terms from the search. Provide a comparison of ranking in graph form between expansion and no expansion (make a graph of recall and precision ROC). 50-70 Q4 Implement (Q3) and Implement non-keyboard method of querying and query execution. Execute a query from page text (maybe with help of browser plug-in), submit whole page for search (may need some text processing to do so) 75-90 User Interface U1 Implement an interface that mimics your favourite 10-blue-link search engine (use your own branding). 10 U2 Implement the basic interface but provide filters and sorting controls to control the order/appearance of results e.g. order by date, filter by file type, filter by author, filter by entity 20 U3 Implement text or visualisation interface facilitating navigation of document clusters. Provide filters and sorting controls to control the order. Alternatively, implement a desktop application that monitors what you are writing in real time and makes suggestions for references in your document library or recent work folder. Display these in lists organised by Metadata facets or some other organisation scheme. This will require a range of indexing and text processing ideas. 50-70 U4 Implement sophisticated visualisation based interface using visualisation library or Web Apps library visualising Metadata and Semantic relationships. Provide multiple filtering and sorting capabilities. Provide a data to visualisation mapping configuration tool (change display variables), provide an interaction logging facility, and implement user feedback. 75-100 Ranking R1 Implement custom ranking algorithm that ranks documents according to several Metadata attribute (like word count). Devise a rating system for documents based on importance of document, age of document, word count, source, file type etc. 10 R2 NA NA R3 Implement a number of sophisticated ranking algorithms taking into account keywords, Metadata, page structure, Named Entities and User Models. Make systematic comparisons between competing ranking algorithms and provide graphs of recall and precision ROC as your output. 50-70 R4 Implement a number of sophisticated ranking algorithms taking into account keywords, Metadata, page structure, Named Entities and Thesaurus expansion, Make systematic comparisons between competing ranking algorithms and provide a graph of recall and precision ROC as your output. This will require you to make use of test corpus. Provide description on interface of what ranking is currently in use, and provide a break down of why a document is ranked the way it is (go beyond a score – break the score down). 70-90 Text Operations T1 Add additional fields to index for range of documents; date of indexing, word count, file size (bytes), file type etc. 10 T2 Implement parsing of PDF documents (with help of library). Recover structure from PDF documents and provide searching per section for PDF documents. 30-50 T3 Implement clustering scheme over corpus, index clustering output and cluster descriptors, present search output as lists of document clusters which are navigable. 50-70 T4 Implement sophisticated parsing of PDF (or other structured) documents, indexing document structure and provide searching facility per structure of document. Match bibliography references to structural areas of the text and provide searching facilities per researcher. Output should provide a portion of text in and around the reference. Clicking on the result link should take you straight to the section on the page. Clicking alternative link should display all segments of text where the author has referenced the particular research. 85-100 Searching & Indexing S1 NA NA S2 Implement an auto complete facility using Ajax 20 S3 Implement (S2) and implement query suggestion facility (did you mean...X), Implement a dictionary correction facility. Create index scheme to maximise index freshness by checking several locations for new information to index. Research and implement duplicate record detection and perform necessary removal in index. Provide an html report of index size, average field sizes, and other relevant information. Provide a report of your duplicate document detection results (state which documents are duplicated and where). 40-60 S4 Index series of RSS Feeds –(news, sport feeds etc.); provide facility to set up keyword lists for alerts (via interface). Push filtered results to real-time (but simplistic) user interface. Using RSS feeds as your input implement a distributed index, per field and implement parallel searching on distributed indexes to provide searching facility of RSS feeds. 70-85 You should be thinking about your assignment now, researching ideas, thinking about the problem you are tackling, why it is a problem, the characteristics of the problem and your proposed solution. Lucene Apache + NLTK/Scikit Python Refs http://lucene.apache.org/ http://lucenenet.apache.org/ http://opennlp.apache.org/ http://tomcat.apache.org/ http://scikit-learn.org/stable/ https://www.nltk.org/ Tutorials http://www.lucenetutorial.com http://www.solrtutorial.com/ http://www.elasticsearchtutorial.com/