Ryan Muther, David Smith and Sarah Savant
Time and Place: Thursday, 01.07., 15:25–15:45, Room 1
Session: Co-authorship and Citations
Keywords: computer science; digital humanities; NLP; network inference; entity linking
Understanding the role of individuals in the transmission of historical knowledge is key to gaining deeper insight into the evolution of the written tradition over time. In classical Arabic, authors of religious and historical texts often give their sources in the form of an isnad, or chain of transmission. Unlike citations of individual authors and works conventional in scholarship today, isnads provide fuller provenance by listing the transmitters of a piece of information or anecdote going back to a reputable authority, often the Prophet or a notable scholar. Taken singularly, most isnads are a textual representation of a single chain of individuals; collectively, a group of isnads represents an interconnected network of scholars, which can be analyzed by historians to gain a broader understanding of the social dimension of the process of textual production. Due to the complexity of this style of presenting evidence, most prior work on isnads has focused on collections of hadith, which have a predictable structure amenable to processing with regular expressions and other lexical features (Altammami et al, 2019). In this paper, however, we focus on extracting scholarly networks from more open-ended historical texts.
These textual representations of information provenance are difficult to use as-is due to the ambiguity often present in the names used to refer to individuals. An individual may be referred to by several different names, and multiple distinct individuals can be referred to by the same name. To resolve this ambiguity, each name needs to be assigned an individual to which it refers. In ambiguous cases, information from the surrounding context in which the particular mention occurs can be used to infer which individual a mention refers to. Most open-domain NLP systems for name disambiguation and entity linking, however, rely on matching the text near a name mention with a discursive description of an entity in a small number of broad-coverage resources such as Wikipedia (Durrett, 2014 and 2018). For isnad texts, this standard approach is insufficient since (1) many names mentioned in isnads do not correspond to Wikipedia articles or to the standard forms mentioned in historical biographical dictionaries, and (2) the context for a name in an isnad consists almost entirely of a list of other names, rather than discursive text.
Instead, we represent the relationships among name mentions in isnad networks using the contextual embeddings of tokens inferred by a transformer language model. Starting with a masked language model trained on the Arabic Gigaword corpus, we further train the language model’s parameters to predict full name mentions masked using an Arabic named entity recognition model. (The baseline NER system achieves 96% F1 on names in isnads, while the masked name prediction model predicts the correct name with 81% accuracy.) We can then find the k nearest neighbors for each name mention, since the contextual embedding allows us to distinguish identical surface forms in different contexts. We then apply community-detection algorithms to infer clusters of mentions referring to the same individual.
To evaluate this approach, we use a set of 2,381 isnads taken from Ibn ‘Asakir’s 12th-century History of Damascus (Tarikh Dimashq), each of which has been annotated by a domain expert to create gold-standard data for training and evaluation by manual name disambiguation, giving a total of 14,455 mentions, of which 13,072 have been linked to known individuals. The remaining 1,383 mentions were too ambiguous for the expert to determine who the referenced individual is.
Using the disambiguated names, we can begin to look more closely at related problems, such as inferring network structures from the text of isnads, inferring missing nodes in some chains of transmission, and visualizing the social networks involved in the production of Arabic texts to answer questions in book history.
Altammami, Shatha, Eric Atwell, and Ammar Alsalka. “Text Segmentation Using N-Grams to Annotate Hadith Corpus.” In Proceedings of the 3rd Workshop on Arabic Corpus Linguistics, 31–39. Cardiff, United Kingdom: Association for Computational Linguistics, 2019. https://www.aclweb.org/anthology/W19-5605 .
Durrett, Greg, and Dan Klein. “A Joint Model for Entity Analysis: Coreference, Typing, and Linking.” Transactions of the Association for Computational Linguistics 2 (December 2014): 477–90. https://doi.org/10.1162/tacl_a_00197.
Mueller, David, and Greg Durrett. “Effective Use of Context in Noisy Entity Linking.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1024–29. Brussels, Belgium: Association for Computational Linguistics, 2018. .