Objective:
Select a research paper [between the year 2016 - 2022] of your choice related to any of the
concepts related to information retrieval. Read, understand and summarize it in your own words.
Note:
Theoretical or implementation papers are encouraged, while Survey papers are not entertained unless
it has considerable substance.
Instructions:
Summarize the contents in your own words in a condensed form which must include the following:
Problem statement
Solution approach
Architecture ( If any)
Results
Conclusion
You are encouraged to find out a problem/drawback/ limitation of the paper and suggest
improvements for the approach given in the paper.
Prepare a report of 3 - 6 pages (1500 - 3000 words).
You may refer to the below links to find some sample papers or refer to IEEE, ACM or any other
journal/conference/white papers.
5. Attach the original paper while submitting it.
I have selected a paper on Soundex Algorithm (attached) and wanted to have report in above format.
Microsoft Word - ICICC2020_Elsevier_515_kant.docx International Conference on Innovative Computing and Communication (ICICC 2020) TAPAN KANT, SHAILESH KUMAR SHRIVASTAVA, NIRAJ KUMAR TIWARY, NAGMA PARWEEN 1 SoundexHindi: A Phonetic Matching Algorithm for Hindi Written in English Tapan Kanta, Shailesh Kumar Shrivastavab*, Niraj Kumar Tiwaryb, Nagma Parweenb aPatna Women’s College, Bailey Road Kidwaipuri, Patna,800001, India bDigital Government Research Centre, National Informatic Centre, Govt. Of Inida Patna, 800001, India Abstract: Phonetic matching plays key role in retrieving information in environment using multiple indic languages. The purpose of retrieval of information involves searching of information from large database with different perspective so that further analysis can be done. In order to achieve these techniques such as indexing can be used along with ranking of matching. The strings actually represent the same keyword however due to language complexity it is possible that the same word may be spelt differently by different persons. Since in rural or urban areas, the selected word may be spelt or pronounced either wrongly or can be spelt differently. In cases of exact matching of words may not return correct result rather different techniques associated with phonetic matching can be used for retrieval. In this paper we proposed an approach, which provides a simple and efficient way of matching the strings in Hindi written in Roman. Our approach works on text-conversion technique for any Indian languages, especially Hindi, Marathi etc. 1. Introduction Phonetic Matching can be defined as a mechanism of strings patterns matching, the manner in which someone pronounces a word. These words have different spelling and writing styles but matched phonetically [1]. In other words, all these words represent the same keyword, but they have different spellings (i.e. Vikash, Vikas, Bikash, Bikas). In India most of the population lives in rural area. People from these rural areas speak Hindi with different accent. They use to write Hindi language in Roman script with variation as mentioned above. Information retrieval from databases requires exact or partial matching of strings with existing information stored in one or multiple columns of the database. Phonetic matching uses the procedure of identification of a set of strings which has sound similar to the most similar keyword derived from the string. In practical approach when a word is entered by different sources then there is always possibility that the strings can be spelled differently due to different writing styles but it is possible that same words spelt differently by two different sources can be phonetically matched. The standard SOUNDEX function works perfectly for English language but fails in Phonetic Matching for Hindi language written in Roman script [2]. The common challenges of name matching are 1) Phonetic Similarity 2) Missing spaces and hyphens 3) Missing components 4) Split database fields 5) Spelling differences 6) Titles and honorifics 7) Out of order components 8) Truncated components 9) Initials. In cases of exact matching of words may not return correct result rather different techniques associated with phonetic matching can be used for retrieval. In this paper we proposed an approach, which provides a simple and efficient way of matching the strings in Hindi written in Roman. Our approach works on text-conversion technique for any Indian languages, especially Hindi, Marathi etc. The rest of the paper is organized as section II establishes the background of our study, in section III methodology has been discussed, results has been explained in section IV and we have concluded our approach in section V. 2. Background Different phonetic matching methods are: 1) List based 2) Word embedding method 3) Common key method 4) Edit distance method 5) Statistical similarity method 6) Hybrid method. i. List Method is a simple and easy technique, in which a list all possible variation of string is maintained and matching is performed through searching. The major drawback of this technique is space and time complexity and difficult of handle variation like nicknames, initials, and titles, without expanding the search space even more [3]. ii. Word embedding usages a numerical vector to represent a word on the basis of its semantic meaning. The two words are said to be semantically similar if they have similar embedding (i.e. “girl” and “woman” have semantically similar meaning in vector space). These words could be missed by spelling-centric methods but have worth in human matching [4]. iii. Common Key method is based on English pronunciation that is the words which have similar pronunciation share same key. Soundex is a well-known, patented in 1918. Fast execution, high recall but good for English names (Latin based languages) useless for Hindi names written in English [5]. iv. Edit distance counts the number of character it takes to change from one word to another. For example “Cyndi” and “Cindy”, the edit distance is one although the words “Katharine” and Catherine”, the edit distance is two. The implementation of this technique is easy but it is limited to Latin-based languages. All exchanges are given equal weightage; if the first character is exchanged then it could entirely change the word [6]. v. Statistical Similarity method takes millions of matching word pairs to train a model, in order to identify, two “similar words” look alike. The Electronic copy available at: https://ssrn.com/abstract=3579322 International Conference on Innovative Computing and Communication (ICICC 2020) TAPAN KANT, SHAILESH KUMAR SHRIVASTAVA, NIRAJ KUMAR TIWARY, NAGMA PARWEEN 2 model takes two words as an input and allocates a similarity score. This method has high precision but the major drawback of this method is requires high end devices to train model [7]. vi. The Hybrid Name Matching method is a combination of two or more of these word matching algorithms to backfill the flaw in one algorithm with the strength over another. We are using hybrid method for the Hindi language based phonetic algorithm which provides a simple and efficient way of matching the strings. Our algorithm (Figure 1) uses three different methods which are: 1) Common key method 2) Edit distance method. 3) Statistical similarity method. 3. Methodology We have used the concept of English SOUNDEX and use it for Hindi. We have vowels like a , aa , e , ee and consonants like ka , kha , cha , ta, tha, da, etc. What we need to do is parse the Hindi words written in English, as per the Hindi alphabets and map all these sets to a common representation (i.e. a unique code for each Hindi alphabet) and we call that representation as soundex or phonetic code for Hindi words. While grouping and mapping Hindi letters into phonetic codes, there are following rules have to be taken into consideration. 1) Parse the Hindi names written in English according to Hindi alphabets. e.g. Ramratan kumar parsed into ('ra', 'm', 'ra', 'ta', 'n', 'k', 'u', 'ma', 'r'). 2) Group short and long vowel to a single code. e and ee is considered as equal. 3) Hindi alphabets when written in English they can be written using different English alphabet like for “ka” in Hindi we can use k , ka , q , qa in English. 4) Group different English alphabet or group of English alphabets that can be used to represent a single Hindi alphabet into one family. e.g. group (k , ka ,kaa , q , qa ,qaa) into one family. 5) After grouping like this, 26 groups are formed. 6) Map each parsed letter of string with its respective code from the encoding table (Table 1). 7) Duplicate consecutive SOUNDEX codes are skipped. 8) Calculate the max length of code and the length up to which code is matched and then calculate matching percentage between the codes to find matching percentage between the names. 9) Names having matching percentage 100 are considered as completely matched name and matching percentage in between 50 to 90 are considered as partially matched names. 10) Names with Matching percentage lower than 50 pass through the edit distance method. FuzzyWuzzy is a package in python that uses levenshtein (edit distance method) to calculate the difference between sequences. Names with matching percentage less than 50(from common key method) pass through token set ratio of fuzzywuzzy library which will do the pairwise substring comparison between two strings and gives matching ratio between them. It increases the accuracy of algorithm. Names with fuzzy ratio (edit distance method) less than 80 will again pass through statistical similarity method (discussed earlier) where we need to train the model to recognize what two “similar names” look like so that the model can take two names and assign a similarity score. Fig. 1 - SoundexHindi Algorithm. 4. Results and Discussion In order to test the algorithm we have taken 500 Hindi names that have same pronunciation but spelling variations. These names are completely rejected by traditional soundex algorithm but accepted by our Hindi soundexHindi algorithm. Among 500 names approx. 250 names are completely matched (100% matching) by our soundexHindi algorithm as shown in table 2. 5. Conclusion We have made a rule based phonetic matching algorithm for Hindi language written in English which has performed well. Significant work has been done and portrayed broadly in this area for English and other languages but there is a very little work done for Hindi language written in English. We have compared our phonetic matching approach with traditional SOUNDEX approach. Some names with same pronunciation that are completely rejected by SOUNDEX have shown 100% matching score with our soundexHindi phonetic matching algorithm. Advantages of our approach are that it gives the user and developer a simple, easy and efficient way of phonetic matching for Hindi language written in English. Electronic copy available at: https://ssrn.com/abstract=3579322 International Conference on Innovative Computing and Communication (ICICC 2020) TAPAN KANT, SHAILESH KUMAR SHRIVASTAVA, NIRAJ KUMAR TIWARY, NAGMA PARWEEN 3 T ab le 1 :E nc od in g/ M ap pi ng T ab le Electronic copy available at: https://ssrn.com/abstract=3579322 International Conference on Innovative Computing and Communication (ICICC 2020) TAPAN KANT, SHAILESH KUMAR SHRIVASTAVA, NIRAJ KUMAR TIWARY, NAGMA PARWEEN 4