Li Zhang, a doctor graduated from our School, published an article entitled with LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation in Journal of the Association for Information Science and Technology (JASIST). JASIST is a top-tier journal in information science, and it is the journal of ASIS&T.
Li Zhang is the first author of the article, and his tutor is Wei Lu, Professor at our School, Director of the Information Retrieval and Knowledge Mining Laboratory, Wuhan University.
The article explores the problem of author ambiguity in the academic world, puts forward a method to automatically build large labeled datasets by using open academic information resources ORCID and DOI, using the method, they built LAGOS-AND, two large, gold-standard sub-datasets for author name disambiguation (AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research and LAGOS-AND-PAIRWISE is created for classification-based AND research. Compared with the existing datasets, the LAGOS-AND datasets present several advantages: the initial versions of the datasets (v1.0, released in February 2021) include 7.5 M citations authored by 798 K unique authors (LAGOS-AND-BLOCK) and close to 1 M instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to the whole Microsoft Academic Graph (MAG) across validations of six facets: author position distribution, publication date distribution, gender distribution, ethnicity distribution, name popularity distribution, and domain distribution.
In building the datasets, the article reveals the variation degrees of last names in three literature databases, PubMed, MAG, and Semantic Scholar, by comparing author names hosted to the authors' official last names shown on the ORCID pages. By connecting the ORCID data to three large literature databases, the variation degrees in last names were identified at 5.80%–9.59%, and if the surname variation achieved by transliterating the special characters into the standard characters (e.g., “á” → “a”), the variation degrees in last names were identified at 8.04%–12.55%. This findings show that besides some authors share the same name (homonyms), the authors' surname variation or name variants (synonyms) is also an important type of author name disambiguation.
Finally, based on LAGOS-AND dataset, the article also builds disambiguation methods. The evaluation results show that incorporating a semantic relatedness feature of citations boosts the performance of disambiguation. It is also found that the accuracy of the author ID in the MAG database is a low level, especially in the recall. Based on this finding, the article suggests that the author ID of MAG should be used cautiously in future research.
The LAGOS-AND datasets are available at https://zenodo.org/record/7313380. At present, LAGOS-AND has two official versions (Version 1.0 and Version 2.0), of which Version 1.0 was built based on MAG in 2019 and Version 2.0 was built based on OpenAlex in 2022. Since its publication, LAGOS-AND dataset has certain influence in academic community. As of September, 2023, the dataset has been viewed more than 1000 times and downloaded more than 180 times.
The article is available at https://doi.org/10.1002/asi.24720