RefB: On the quest of references for collaborative knowledge resources

Published in

MisinfoCon

6 min readDec 9, 2020

In July 2020, I received a WikiCred grant to create a bot to add references from PubMed Central to biomedical statements in Wikidata based on the bibliographic metadata of scholarly publications. After four months of work, I found that this method can return a highly precise dataset of links between biomedical relations and references. However, it can leave an important number of accurate statements without sources. These findings motivate the use of bibliographic metadata coupled to other resources and techniques to drive a reference assignment algorithm for Wikidata and Wikipedia among other collaborative websites. However, they discourage the full dependency on bibliographic information for such a purpose.

Introduction

By the information age, detailed information about a broad range of areas is becoming more findable, accessible and reusable thanks to the Internet. Wikis are considered among the main resources allowing this large-scale multidisciplinary data availability, due to their allowing collaborative creation, editing and illustration of pages inside a given website. Thanks to their flexible software infrastructure, wikis have grown largely in size and coverage thanks to the involvement of many active contributions and computer engineering efforts.

However, this sharp increase of the data provided by wikis, particularly Wikipedia and Wikidata, is coupled with a critical concern about the trustworthiness and reliability of provided information, as these projects are subject to biased editing and vandalism on a daily basis. That is why the design of wiki-based knowledge resources such as Wikipedia and Wikidata implies the use of references to support each statement in a given page [1].

Despite this interesting tendency, most of the current pages in Wikipedia and sister projects lack references, resulting in a huge number of poor quality pages. Currently, 3,561,924 (52%) out of 6,850,704 articles in English Wikipedia are Stub-Class ones suffering from a significant lack of details and sources [2]. This finding also applies to Wikidata, where a significant number of statements still lack references despite the human efforts done to solve this issue and provide more relevance to the knowledge base [3]. This situation should urge Wikipedia and Wikidata Communities to develop automated tools to add references to statements in collaborative projects so that they can be easily verifiable.

RefB: adding references to the biomedical statements in Wikidata

Nowadays, bibliographic databases such as Google Scholar, Scopus and PubMed provide metadata about a large set of scholarly publications. For each research paper, this metadata includes the title of the work, its abstract, its keywords, its authors, its source title, etc. [4]. As this kind of information is representative of the findings and outcomes of a given research publication, and as research publications cover multiple research domains including arts, humanities, medicine, and engineering, we can confirm that bibliographic databases involve a significant set of human knowledge and can be consequently used to find references of unsupported statements in collaborative knowledge resources. In this context, I have decided to benefit from such metadata as available in PubMed Central, an open-access clinical research database, to find references for the biomedical Wikidata statements lacking a source [5].

In July 2020, I began working on a bot entitled “RefB” for this interesting purpose. I implemented the bot in Python 3.5 and asked for permission to apply it on Wikidata by July 29th. However, I had to wait until September 2nd to get my bot proposal approved by the Wikidata Community and consequently to test the efficiency of this idea [6].

After getting the approval of the RefB bot request, I applied the Program to a subset of the biomedical knowledge on Wikidata. The bot has successfully extracted an exhaustive list of unsupported biomedical statements thanks to Wikidata Query Service, which allows one to use SPARQL to extract required information according to a searched pattern. The retrieved information involves the labels of the subject, object and property of each unsupported statement (e.g. Subject: “spondylosis”, Property: “health specialty”, Object: “orthopedics”).

I restricted the extraction of unsupported statements to relational ones having a Wikidata item as an object because relational statements are easier to verify in raw texts than non-relational statements that require advanced NLP techniques and machine learning to identify object values. Then, I searched PubMed Central using Biopython Library to find the scholarly papers including the subject, property and object of each statement. I restricted my search to the three most relevant results of the search process and I verified the existence of the subject and object of the statement in the abstract of each of the three publications to verify whether these publications are really references confirming the statement or not. After the automatic validation of the references, they are automatically added to Wikidata thanks to the QuickStatements API that allows automatic mass upload of structured information to Wikidata [5, 7].

Results and Discussion

After running the bot several times between September 2nd and November 3rd, the bot has successfully screened 5,384 statements on Wikidata. From these analyzed relations, only 451 statements (8.3% of the assessed biomedical statements) have been assigned 842 references (1.9 references per statement) [8]. When verifying unmodified statements by hand (4,933 relations), I found that 47.4% of them are accurate.

This proves the limited recall of the algorithm and raises several concerns about the efficiency of the use of bibliographic metadata without the full screening of biomedical texts to identify references for the biomedical statements. The screening process itself is slow as it only assesses 70 unsupported relations per hour and does not significantly exceed the human speed of adding references to statements. This means that the developed algorithm is expensive from the side of power consumption and should be enhanced to ameliorate its overall speed. This could certainly be a challenging task due to the timeout limit for the PubMed Entrez API and for the Wikidata API.

Despite these limitations, the detailed analysis of the 842 references added by RefB returns that 97.3% of these sources are accurately assigned to corresponding statements. This confirms the high precision of the algorithm fully based on bibliographic metadata to add references to biomedical statements.

This suggests that bibliometric-enhanced information retrieval methods can be useful to create a basic reliable dataset of the links between biomedical relations and their references that can be expanded later by promising NLP-supported machine learning techniques applied to full texts of scholarly publications (e.g. Language Models, Latent Dirichlet Allocation, and Deep Learning) to identify precise references for the other relations that were missed by the RefB algorithm. As well, this confirms that bibliographic metadata can be also used, alongside knowledge graphs like the ones included in Linked Open Data Cloud, to drive semi-automatic tools for the enrichment of Wikidata statements with references such as the Reference Hunt Game [9].

Conclusion and Future Directions

In this research project, I tried to assess the efficiency of bibliographic metadata to add references to statements in collaborative resources like Wikidata. Despite the considerably limited recall and speed of such an algorithm, its exceptional precision can ensure promising results by embedding it in other reference assignment initiatives, such as machine learning techniques and semi-automatic tools. Given this, I will try in the next few years to enhance this algorithm to make it more efficient and to adapt it by using several NLP techniques, particularly Semantic Similarity Measures and Word Embeddings, to support reference-needed statements in Wikipedia.

Acknowledgements

I have to thank Dr. Mohamed Ali Hadj Taieb and Dr. Mohamed Ben Aouicha from the Data Engineering and Semantics Research Unit, University of Sfax, Sfax, Tunisia and Mr. Ahmed Medien from WikiCred Grants Initiative for their useful comments and discussion about this research project.

The WikiCred Grants Initiative supports research, software projects and Wikimedia events that explore information reliability and credibility. WikiCred’s funding is provided by Craig Newmark Philanthropies, Facebook, and Microsoft.

References

[1] https://en.wikipedia.org/wiki/Wikipedia:Citing_sources
[2] https://en.wikipedia.org/wiki/Wikipedia:Statistics
[3] Turki, H., Shafee, T., Hadj Taieb, M. A., Ben Aouicha, M., Vrandečić, D., Das, D., & Hamdi, H. (2019). Wikidata: A large-scale collaborative ontological medical database. Journal of biomedical informatics, 99, 103292.
[4] Falagas, M. E., Pitsouni, E. I., Malietzis, G. A., & Pappas, G. (2008). Comparison of PubMed, Scopus, web of science, and Google scholar: strengths and weaknesses. The FASEB journal, 22(2), 338–342.
[5] https://misinfocon.com/refdata-adding-trustworthiness-to-wikidata-d3cc68c21a6f
[6] https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/RefB_(WikiCred)
[7] https://github.com/Data-Engineering-and-Semantics/refb/
[8] https://xtools.wmflabs.org/ec/www.wikidata.org/RefB%20%28WikiCred%29
[9] https://github.com/wmde/reference-island