Banner: Scraping Disboard Data

Scraping Disboard: A Methodology

By PS Berge

🐩 @theiceberge       📧 hello@psberge.com

Note: This is not a technical guide for installing my fork of the Disboard scraper. For that, please see the Google Colab below or view the full repository here!

Discord’s ‘Iceberg’ Effect

It’s kind of funny: when Dan approached me in Spring of 2020 to say that he was interested in studying political activist groups on Discord, I actually brushed him off. "It’s not possible. Not worth it," I had said. "IRB approval for anything Discord-related would be a nightmare, since so much is considered private. Not to mention, how will you find these communities?" At the time, Discord’s "Discovery" feature was only available to partnered and verified servers over 10,000 members. But Dan had actually been looking into this. "I think I might have something to help with that," he told me.

As I described in our announcement post, one of the trickiest issues with studying Discord is the semi-private nature of communities. Because Discord servers 1) require invitations to join and 2) are unlikely to be indexed in Discord’s integrated search (unless they have over 1000 members and opt-in to Discord’s Discovery features), most of the activity of Discord’s platform remains hidden. In our New Media & Society article, Dan and I call this the ‘iceberg effect’ of Discord; as new users are confronted with an ‘illusory public’ of curated, approved servers. While this practice is optimal for reducing exposure of new users to hate groups, on the platform, this makes it difficult for researchers who are interested in researching communities that do not appear in the ‘curated layer,’ which includes 1) any community smaller than 1000 members 2) any community with NSFW content 3) hateful communities. For researchers, its important that we maintain tools that can account for Discord’s iceberg effect.

Dan’s solution, of course, was to look outside the platform itself, at popular server bulletin site Disboard, which displays public community metadata. Data on Disboard is public. It can be scraped, analyzed, and–thanks to its tag-based indexing system–processed as networked data. This was precisely what we did in our NMS article, and our study circumnavigates the ‘iceberg effect’ by examining the irrefutably public metadata of Discord servers. Our solution to this is imperfect, but it is a step forward.

In the spirit of D/ARC’s mission to equip researchers (and my love for open-source) , today I am publishing all of the code we used to collect data in "Mapping Discord’s Darkside" (with some small improvements). I am releasing these in two forms, a Colab Notebook version of the scraper which runs in Google Drive and requires 1) no coding experience and 2) minimal computer processing power. In the Google Colab, I’ve specifically set up the instructions to be friendly to newcomers with little Python experience. The Notebook is fully documented with tutorials, and even my mother who has never seen Python in her life can run it! 🐍 I am also publishing the GitHub repository which contains the full code for the scraper for easy modification by those familiar with Python.

It’s my hope that giving scholars a way to quickly examine networks of servers through Disboard will empower further research into Discord communities. But in this post, I want to share some of our reasoning for how and why one might scrape Disboard in the first place, how we have gone about it, and what further interests this tool might serve for aspiring Discord scholars.


A banner that reads: Disboard Scraper & Analysis Notebook By PS Berge. Part of the Discord Academic Research Community.

Disboard 101

I’ll begin with the basics: Disboard is a public bulletin site for Discord servers. It is not affiliated with (though it loves to emulate) Discord. It gets listings from servers which have installed Disboard’s bot (a third-party program that users invite to their servers) and shares the following data about each community:

  • The server name.
  • A description.
  • The number of members currently online.
  • An invite link to the server.
  • (Up to) five descriptive tags, chosen by the lister.

The scraper we use for Disboard is a fork of a repository originally coded by DiscordFederation (specifically, daegontaven) on GitHub. My iteration on this is largely reliant on the original code; I’m immensely grateful to daegontaven for providing this repository with the generous MIT 3.0 license. That said, my scraper varies in a few key ways, namely by 1) collecting descriptions 2) using a sleep-timer to allow large-scale scraping without becoming rate limited and 3) some basic HTML cleanup features. When you run the scraper, you simply input a search term, and the scraper will collect every server that includes that token as one of its five tags on Disboard (note that this process is tag-based not name-based). Finding tags to scrape can be done simply by searching Disboard and exploring what tags you might want to scrape. In designing your study, you should leave the opportunity for adding additional tags later, as you will almost certainly encounter co-occurring tags that you will want to add to your sample.

A network map of servers, visualized using the Google Colab notebook provided.
A visualization of tags linked to ‘toxic’ Discord servers, scraped from Disboard.

Scraping Disboard provides brief previews of how Discord communities are publicly marketing themselves. Note that this method does not scrape any information from within these servers! However, by looking at these public names, descriptions, and tags, one is able to:

  • Break through the ‘iceberg effect’ of Discord, by examining servers not otherwise searchable (smaller servers, servers that don’t want to be listed on Discord, and servers that failed the safety requirements of Discovery).

  • See how servers are attracting new users. These descriptions generally focus on recruiting new members to the community—and, as we show in our study, there is much to examine about how these spaces are marketing themselves to new users.

  • View Discord as a networked system. By examining these tags, we can quickly get a sense of how Discord’s broader ecology is networked. Because users select the tags for their networks, they are choosing to present their community under certain search terms. (See above for an example of a network visualization of co-occurring tags from >4000 “toxic” servers).

Again, this isn’t perfect: Disboard lists just over 1.2 million Discord servers (of 19 million on the platform). Yet scraping Disboard provides a more robust sampling of communities than anything within Discord’s own systems. Notably, Disboard is not the only bulletin site worth examining for such inquiries into Discord, Top.gg, discordservers.com, and even Discord’s own Discovery all provide partial-but-useful glimpses into the broader practices of communities.

A screenshot of Disboard's interface, showing a search for gaming communities with over 189 thousand results.
An example tag-search on Disboard.

Working With Disboard Data

Once you’ve collected data from a number of tags on Disboard, you’ll have anywhere between one and several dozen CSV files with server data. Making use of such data will depend on the object of your study. If you are a social scientist, digital humanist, or internet researcher interested in exploring relationships between servers, the discourse of certain communities, or activity levels over time, you will likely want to borrow from some of our methods directly. Using your data, you can:

  • Track relationships between different tag-clusters using Python.
  • Code and analyze server descriptions using close and distant reading.
  • Quantitiatively explore the popularity of tags and terms, and examine how they are being used in context.

Such approaches allow us to assess relationships between the discursive elements of these communities. Such work opens multiple considerations for how power (in the cyberfeminist sense) is being distributed across the network. As our study has shown, despite the alleged privacy of Discord communities, hate groups readily mobilize against marginalized communities on these networks. Bringing in the additional, networked context of Disboard data should be done with careful consideration for how technology and discourse shape and are shaped by power dynamics. If you’re looking to develop such research, here are a few key recommendations for crucial literature in this area:

  • Dr. AndrĂ© Brock, Jr.’s award-winning book Distributed Blackness: African American Cybercultures and his important article on Critical Technoculture Discourse Analysis. Brock’s framework for approaching cyberculture focuses on bringing both technology and contextualizing discourse in conversation with one another. While Disboard data may provide useful context for research, it is important that we qualify that data with other cultural and technological forces.

  • Data Feminism by Catherine D’Ignazio & Lauren Klein (2020). Data Feminism clarifies not only important practices for researchers, but centers the dyanmics of power and hegemony in social media networks. Understanding cyberfeminist approaches to technology are important in thinking about the complexity of Discord communities and networks.

In addition to theoretical tools, there are a few technical tools that I highly recommend to support your analysis of Discord data, especially in high volumes. (A planet-sized thank-you to Dr. Anastasia Salter and the UCF Texts & Technology faculty for teaching me everything I know about these tools!):

Python

Python is going to be the most effective tool for quickly examining your Disboard data. If you’re unfamiliar with Python, or want a few quick tools to get started, the notebook-version of the scraper includes many of my go-to functions for looking at data from Disboard CSV files. Note that even if you’ve scraped your data elsewhere (such as with the GitHub repository version), you can still analyze your data using the Colab notebook. Alternatively, you can download the notebook and run it and modify it locally.

Orange Data Mining Tools

Orange includes a robust toolset for examining data, including text. You’ll want to install the ‘text tools’ plugin, which provides many of the same functionalities as Python, but with no coding experience required. There are many tutorials on Orange, but I recommend starting with this wonderful guide from an NEH course on Understanding Digital Culture put together by some of the rockstar faculty at my own UCF’s T&T program.

Discord-Scraper by Dracovian

This scraper allows you to pull messages from Discord text channels (and even direct messages) and store them. Note that such scraping will usually require approval from a Review Board, but for ethnographic and even interview-based studies this can be enormously helpful.

Note that technically performing automized tasks from a personal Discord account violates Discord’s terms and conditions. Ideally, this is best done by setting up a bot and collecting messages that way.

AntConc Concordance Tools

AntConc is like CTRL+F with a nuclear jetpack. It allows you to quickly search through your dataset and find co-occurring words and tokens, build concordances, and works with high volumes of data (even dozens of different files, so you can keep your tag-based data seperate). In our study, I used AntConc to examine co-occurring tags and expand (and narrow) our sample by identifying which tags were relevant to hate.

You can watch my (admittedly very goofy) tutorial on using AntConc for digital humanist work here:

Other Approaches

Although the tools described above are great for examining the public discourse and networked dynamics of Discord communities, I hope these tools can bolster other approaches as well. I’m aware, for example, that this work has significant implications for scholars focusing on extremism, hate, and predatory behavior online (I was surprised to find several dozen criminology scholars in my Twitter follows after the announcement of our work in NMS).

For those interested in behavior and activity within Discord communities, these methods can be useful for providing important context and identifying ideal spaces for research. For digital ethnographers, Disboard provides a way to quickly scout communities for potential study. Using the tools in the notebook, one can quickly find:

  • The most popular communities using a certain tag (sorting by active member count).
  • Commonly co-occurring tags, allowing one to easily snowball and find other relevant samples.
  • Public-facing interest communities, which may be more receptive to interviews or other kinds of study.
  • Whether or not a community is a part of a network of same-name or associated servers (such as the white-supremacist recruitment networks we identified in our study).

I hope these tools will bolster more qualitative, ethnographic, and deep-dive approaches to Discord communities. That said, although there remains far too little ethnographic study of Discord and Discord users, we are not without precedents here (and I encourage you to check out the D/ARC Zotero library for updates!). In particular, there are two important studies in this vein that I want to spotlight:

  • D/ARC moderator Nick-Brie Guarriello’s article "Not Going Viral: Amateur Livestreamers, Volunteerism, and Privacy on Discord" (2021) examines, through embedded study, gaming livestream communities on Discord, and the precarity of copyright and privacy for Discord gaming communities. She also highlights the impact of COVID-19 on Discord spaces as well as intersections between gaming and Discord culture.

  • Jiang et. al’s study "Moderation Challenges in Voice-based Online Communities on Discord" (2019) remains one of the most exhaustive interview-based studies of user activity and behavior within Discord communities. The authors interviewed 25 different moderators, and write extensively about the challenges of moderation, and Discord’s technical affordances (bots, roles, etc.).

Join D/ARC!

A big part of changing the landscape of Discord research is through community. We hope this can be a space to build and share together. Note that the D/ARC is open to all researchers, especially students, independent scholars, and junior faculty. To join the server, simply click the link below (or any of the other million links on this site, I’ve put it almost everywhere!).

And if that’s not incentive enough, I’m going to be giving a live workshop on working with Disboard data for D/ARC members on Friday, February 4th at 7:00PM Eastern Time (GMT-5:00)!

Banner that reads: Join Us! Community in D/ARC Times. The Discord Academic Research Community.

Want Additional Help With the Scraper?

The best way to get help with the Scraper is to join the D/ARC server and post in the ❓-technical-questions channel. That said, if you or your research team need additional support in setting up tools, or want me to run a guided workshop for working with Disboard data using the notebook, please don’t hesitate to contact me (@theiceberge or hello@psberge.com).

Updates and Disclaimer

This post was last updated on 12.23.2021 by PS Berge.

Note that I am providing these tools to empower Discord researchers in the academic community. I am in no way taking responsibility for the things you do with these tools! As stated in the documentation of the notebook, I expect that if you will not use our tools or methods to harass others, reentrench hegemonic systems, or perform unethical research practices. Although these methods help circumnavigate deprecated limitations by research boards that does not mean your research should not be rigorously guided by an ethical framework. If you are unfamiliar with designing ethical study design for internet research, a good place to start is the Association of Internet Researchers Ethics Guidelines.