From Millions of Words to a Single Phrase Examination of the TXM Software in Hebrew

1. Abstract

Following the technological development during the second half of the 20th Century, linguistic researchers have begun for the first time to analyze corpora of big data. One research methodology, which was developed for lexicometry and text statistical analysis, is Textometry. I.e., the attempt to combine various statistical analysis techniques, such as factorial correspondence analysis (Benzécri, 1977) and hierarchical ascendant classification (Ward Jr, 1963), with full-text search techniques such as kwic concordances (Luhn, 1960), in order to trace the precise original editorial context of any textual event participating to the analysis.

The TXM software (Heiden, 2010), which was developed in France as a modular platform of a new generation of textometrical research and which I adapted to Hebrew, gives the ability to analyze a large corpus of texts as in my research, by using tools and methods based on linguistics and discourse analysis, that is, decomposing the text into factors and elements, carrying out statistical analysis, identifying the hidden social patterns, and then restoring the corpus to its original mode.

As a Ph.D. student in the discipline of Social Sciences, my research focuses on the image repair theory (Benoit, 2015) as an ensemble of strategies such as evading responsibility and reducing offensiveness, used by individuals, organizations and groups in order to repair their image during times of crisis. It examines the ways in which rhetorical measures are used in online verbal exchanges among users of the online social networks who attempt to repair their personal image. To achieve my goal, I use the TXM software to analyze a corpus of more than eight million words in 365 Facebook posts, which were published by the Israeli Prime Minister Benjamin Netanyahu during his current affairs, and more than 285,000 comments, made by the users. Netanyahu's Affairs are four police investigations in which he is involved as a suspect or has given a testimony.

During the poster session, I will present a review of some tools I used with the TXM software during the digitized analysis process in my Ph.D. research and the way they allowed me to recognize the following key phrase used by Netanyahu: "They have the media, we have you". Among these tools are Progression, which is the frequency of occurrence of one term throughout the text corpus; Co-occurrence, which is the frequency of occurrence of two terms in a text corpus alongside each other in a certain order; and Specificity, which is the score a term is given based on its occurrence in the corresponding part of the corpus relative to the one in the entire corpus, and indicates whether it is overused, underused or useless.


Benoit, W. L. (2015). Accounts, excuses, and apologies: Image repair theory and research. Albany, New York: State University of New York Press.

Benzécri, J. P. (1977). Sur l'analyse des tableaux binaires associés à une correspondance multiple. Cahiers de l'Analyse des Données, 2(1), 55-71.

Heiden, S. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In R. Otoguro, Ishikawa, K., Umemoto, H., Yoshimoto, K., Harada, Y. (Eds.), 24th Pacific Asia Conference on Language, Information and Computation - PACLIC24 (Pp. 389-398). Waseda University, Sendai, Japan: Institute for Digital Enhancement of Cognitive Development.

Luhn, H. P. (1960). Key word?in?context index for technical literature (kwic index). American Documentation, 11(4), 288-295.

Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301), 236-244.

Maxim Lengo (, Bar-Ilan University, Israel

Theme: Lux by Bootswatch.