Untitled Effective Self-Training Author Name Disambiguation in Scholarly Digital Libraries Anderson A. Ferreira Adriano Veloso Marcos André Gonçalves Alberto H. F. Laender Departamento de Ciência da...

1 answer below »
I need help writing a literature review on a topicA LIBRARY USER’S TUTORIAL SYSTEM You will find the articles ( 3 articles ) you need in writing literature in the attachmentsAnd a copy of my project report.


Untitled Effective Self-Training Author Name Disambiguation in Scholarly Digital Libraries Anderson A. Ferreira Adriano Veloso Marcos André Gonçalves Alberto H. F. Laender Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 Belo Horizonte, Brazil {ferreira, adrianov, mgoncalv, laender}@dcc.ufmg.br ABSTRACT Name ambiguity in the context of bibliographic citation reco- rds is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distin- guish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manu- ally labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic ac- quisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training As- sociative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. The second step uses a supervised disambiguation method that is able to detect un- seen authors not included in any of the given training exam- ples. Experiments conducted with standard public collec- tions, using the minimum set of attributes present in a cita- tion (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms rep- resentative unsupervised disambiguation methods that ex- ploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Information Retrieval; I.5.2 [Pattern Recognition]: Classifier design and evaluation General Terms Algorithms, Experimentation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’10, June 21–25, 2010, Gold Coast, Queensland, Australia. Copyright 2010 ACM 978-1-4503-0085-8/10/06 ...$10.00. Keywords Name Disambiguation, Bibliographic Citations 1. INTRODUCTION Several scholarly digital libraries (DLs), such as DBLP1, CiteSeer2, MEDLINE3 and BDBComp4, provide features and services that facilitate literature research and discovery as well as other types of functionality. Such systems may list millions of bibliographic citation records (here understood as a set of bibliographic features such as author and coauthor names, work title and publication venue title, of a partic- ular publication) and have become an important source of information for academic communities since they allow the search and discovery of relevant publications in a centralized way. Also, studies about the DL content can lead to inter- esting results such as coverage of topics, tendencies, quality and impact of publications of a specific sub-community or individuals, patterns of collaboration in social networks, etc. These types of analysis and information, which are used, for example, by funding agencies for decisions on grants and for individual’s promotions, presuppose high quality content [20, 22]. Citation management within DLs involves a number of tasks. One in particular, author name disambiguation, has required a lot of attention from the DL research community due to its inherent difficulty. Specifically, name ambiguity is a problem which occurs when a set of citation records con- tains ambiguous author names (the same author may appear under distinct names, or distinct authors may have similar names). This problem may be caused by a number of rea- sons, including the lack of standards and common practices, and the decentralized generation of content (e.g., by means of automatic harvesting). The name disambiguation task may be formulated as fol- lows. Let C = {c1, c2, ..., ck} be a set of citation records. For each author in a citation record ci, an authorship record ri is created to represent his/her participation in that cita- tion. The objective is to produce a disambiguation function which is used to partition the set of authorship records into n sets {a1, a2, . . . , an}, so that each partition ai contains (all and ideally only all) the authorship records in which the ith author appears. 1http://dblp.uni-trier.de 2http://citeseer.ist.psu.edu 3http://medline.cos.com/ 4http://www.lbd.dcc.ufmg.br/bdbcomp 39 To disambiguate the bibliographic citations of a digital li- brary, first we may split the set of authorship records into groups of ambiguous authors, called ambiguous groups (i.e., groups of citations having authors with similar names). The ambiguous groups may be obtained, for instance, by using a blocking method [27]. Blocking methods address scala- bility issues, avoiding the need for comparisons among all authorship records. The challenges of dealing with name ambiguity in biblio- graphic DLs have led to a myriad of disambiguation meth- ods [4, 5, 8, 9, 14, 15, 16, 17, 18, 19, 23, 25, 26, 27, 28, 33, 34, 35, 41]. However, despite the fact that most of these meth- ods were demonstrated to be relatively effective (in terms of error rate or similar metrics), none of them provides a perfect and final solution for the problem (i.e., they pro- duce errors). Existing disambiguation methods usually fol- low either an unsupervised or a supervised approach. In the former case, the methods exploit similarities between au- thorship records in order to place in the same group those records that belong to the same author. In the latter case, the methods exploit a set of training examples, from which a disambiguation function is derived and then used to place authorship records in the corresponding group. Supervised methods are usually the most effective ones for name disambiguation. In more details, we are given as input a set of authorship records called the training data (denoted as D) that consists of examples or, more specifically, records for which the correct authorship is known. Each example is composed of a set F of m features {f1, f2, . . . , fm} along with a special variable called the author. This author vari- able draws its value from a discrete set of labels {a1, a2, . . . , an}, where each label uniquely identifies an author. The training examples are used to produce a disambiguation function (i.e., the disambiguator) that relates the features in the training examples to the correct author. The test set (denoted as T ) for the disambiguation task consists of a set of authorship records for which the features are known while the correct author is unknown. The disambiguator, which is a function from {f1, f2, . . . , fm} to {a1, a2, . . . , an}, is used to predict the correct author for the records in the test set. In this context, the disambiguator essentially di- vides the records in T into n sets {a1, a2, . . . , an}, where ai contains (ideally all and only all) the authorship records in which the ith author is included. Alternatively, the disam- biguator may take as input a pair of authorship records and outputs a binary decision whether these records belong to same author. Although successful cases of the application of supervised methods have been reported [9, 14, 17, 36, 35, 37], the ac- quisition of training examples requires skilled human anno- tators to manually label authorship records. DLs are very dynamic systems, thus manual labeling of large volumes of examples is unfeasible. Further, the disambiguation task presents nuances that impose the need for methods with specific abilities. For instance, since it is not reasonable to assume that all possible authors are included in the train- ing data, disambiguation methods must be able to detect unseen authors, for whom no label was previously assigned (i.e., there is no authorship records for these authors in the training data). Unsupervised methods, on the other hand, require no manual labeling effort, since they simply group authorship records into clusters by maximizing intra-cluster similar- ity while minimizing inter-cluster similarity. Obviously, the choice of a proper similarity measure is of paramount impor- tance, and a natural choice is to employ similarity measures based on highly discriminative features, such as coauthor names. In this case, the resulting clusters are very likely to be pure, in the sense that each cluster is likely to contain only authorship records of the same author. The drawn- back, however, is that some authors are likely to have their authorship records fragmented into several (pure) clusters, compromising the effectiveness of unsupervised methods. In this paper, we propose a hybrid disambiguation method, which will hereafter be referred to as SAND (standing for Self-training Associative Name Disambiguator). SAND ex- ploits the strengths of both unsupervised and supervised methods. Specifically, it works in two steps. In the unsu- pervised step, recurring patterns in the coauthorship graph are exploited in order to produce pure clusters of author- ship records. Then, in the supervised step, a subset of the extracted clusters is provided as training, from which a disambiguation function is derived. The final result is a highly effective and extremely practical disambiguator, as will be shown in a set of experiments using citation records extracted from the DBLP and BDBComp collections. The results show that SAND outperforms unsupervised meth- ods in more than 27% on the DBLP collection and 4% on the BDBComp collection. Improvements when compared against supervised methods are also reported. The rest of this paper is organized as follows. In Section 2 we discuss related work. In Section 3 we describe the proposed hybrid method, SAND. In Section 4 we present the evaluation of SAND and compare its effectiveness with the effectiveness provided by other representative methods. Finally, in Section 5 we conclude paper. 2. RELATED WORK The name disambiguation methods proposed in the liter- ature adopt a wide spectrum of solutions [32] that include approaches based on manual assignment by librarians [31], collaborative efforts5, unsupervised techniques [4, 5, 8, 15, 16, 18, 19, 23, 25, 26, 28, 33, 34, 41] and supervised tech- niques [9, 14, 17, 36, 35, 37]. The unsupervised methods, i.e.,
Answered 2 days AfterJul 09, 2022

Answer To: Untitled Effective Self-Training Author Name Disambiguation in Scholarly Digital Libraries Anderson...

Priyang Shaileshbhai answered on Jul 11 2022
71 Votes
LITERATURE REVIEWS
WRITEUP LIBRARY USER’S TUTORIAL SYSTEM
In their study, Vlasenko and Ivanova found that the Internet and its information chann
els are more than just a repository for knowledge; they are also popular modes of communication that can assist interlocutors in gaining beneficial experience. Students increasingly use digital technology to virtualize their education and work, which provides various advantages such as easy access, faster information transmission, and more opportunities for learning.    
This kind of practical training deprives education of the procedures and work principles typical of the professional environment, including those that building facilities might require.
Although the benefits of multimedia for language teaching are undeniable, there are many disadvantages as well. Many of these are related to issues rooted in language learning or the educational process itself. For example, it has been shown that students often misunderstand and misinterpret various social and production processes when performing them virtually rather than in person.
However, in today's world, where virtual conversation has become the norm, many people do not engage in discussion and end up misunderstanding each other. This is why the increasing use of virtual means of communication is ineffective — in-person communication produces more effective results in terms of professional competencies and their formation. Virtualization of the educational...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here