We provide an R-based method for extracting the commentary sections of a website - the contents of which can bias a corpus’ analysis, but can be interesting to study per se.
Studying corpora of websites, through methods such as topic modeling or hyperlink analyses, is an increasingly adopted approach in the humanities (e.g. Severo et al., 2018, Romele et al. 2016, Berthelot et al. 2016), information science (e.g. Bounegru et al., 2017) or social science (Marres, 2015, Froio, 2018). Yet, one part of their content is very often neglected: the comments section.
The biases induced from leaving the comments section
Especially when studying a corpus of websites focusing on controversial topics, commentary sections can induce many biases in the analyses. Comments can express a point of view radically different from the page itself. Hyperlinks present in the comments can point to contents that the owner of the website does not endorse, which can distort any network analysis. The vocabulary used in the comments can also bias content analyses such as topic modelling. It is thus key to eliminate these comments, or to keep them for a separate analysis. We exemplify this through a case study.
Separating the comments from the page: a tedious task
Removing or extracting the commentary sections from a set of websites is in fact a tedious task, thus rarely performed. Many languages can be used to encode the page: HTML 4.0 or 5.0, XHTML, Ajax, Ruby on Rails etc. Some standards obviously exist, for instance for blog platforms, but they are not widely adopted. And unexpected means to open a commentary section (e.g. considering the commentary sections as a subpart of a forum) can frequently occur.
Aiming at exhaustivity: a necessity
Focusing only on the easily retrievable commentary sections would induce important biases. The way the commentary section is encoded is in itself a socially-induced phenomenon, demonstrating the user’s literacy in web programming, or his financial means. Excluding very poorly encoded pages, or virtuoso contents written by expert programmers, could thus translate into excluding specific groups from any further analysis.
A method for extracting comments
The method we propose is not fully automated, and requires a direct identification of patterns delimiting comments sections and comments themselves in the code. Some patterns are relevant for many websites while others need to be carefully designed for a single use. We then provide an implementation with R of a code which carries out the rest of the procedure: after automated quality checks and potential improvements, links and contents coming from comments are subtracted, and the comment-free pages can be analysed. Comment sections themselves can be extracted for a separate analysis.
REFERENCES
BERTHELOT, Marie-Aimée, SEVERO, Marta, et KERGOSIEN, Eric. Cartographier les acteurs d'un territoire: une approche appliquée au patrimoine industriel textile du Nord-Pas-de-Calais, 2016.
BOUNEGRU, Liliana, VENTURINI, Tommaso, GRAY, Jonathan, et al. Narrating Networks: Exploring the affordances of networks as storytelling devices in journalism. Digital Journalism, 2017, vol. 5, no 6, p. 699-730.
FROIO, Caterina. Race, religion, or culture? Framing Islam between racism and neo-racism in the online network of the french far right. Perspectives on Politics, 2018, vol. 16, no 3, p. 696-709.
MARRES, Noortje. Why map issues? On controversy analysis as a digital method. Science, Technology, & Human Values, 2015, vol. 40, no 5, p. 655-686.
ROMELE, Alberto et SEVERO, Marta. From Philosopher to Network. Using Digital Traces for Understanding Paul Ricoeur's Legacy. Azimuth. Philosophical Coordinates in Modern and Contemporary Age, 2016, vol. 6, no 6.
SEVERO, Marta and VENTURINI, Tommaso. Intangible cultural heritage webs: Comparing national networks with digital methods. New Media & Society, 2016, vol. 18, no 8, p. 1616-1635.