New Tool Allows Easy Data Collection from 4chan

Published on Mon Jul 17 20234chan wikipedia spoof | kevin on Flickr 4chan wikipedia spoof | kevin on Flickr

Researchers studying the online imageboard 4chan now have a new tool at their disposal. Developed in Python, the 4chan Text Collection Tool (4TCT) utilizes the 4chan API to collect textual data and metadata from the site. This tool will significantly benefit academics studying 4chan, providing them with an efficient way to gather data for sociological research. The tool has been released to the public and can be accessed through the GitHub repository:

4chan has been a subject of interest for sociological research due to its involvement in various political and cultural movements like Gamergate, Pizzagate, and the Trump presidency. It has also been studied for its far-right content and the use of memes within these movements. Additionally, 4chan has been a valuable resource for computer scientists training models to detect hate speech. The 4TCT tool aims to make it easier for researchers to analyze 4chan and these topics by providing a streamlined method for collecting relevant data.

The tool is designed to be user-friendly, requiring only basic knowledge of Python or Docker to utilize. It offers several examples showcasing its usage, and while it does not include data-processing or loading tools, researchers can customize the Docker runtime by making some code edits.

The algorithm behind 4TCT allows researchers to monitor specific boards or the entire site. It checks for new, active, and 'dead' threads on the monitored boards, collecting and storing the data in JSON files. The tool then repeats the process, ensuring researchers have access to the most recent updates in their data collection.

Data collected using 4TCT is stored in a directory created at runtime, and researchers are urged to comply with ethical standards when compiling databases using the tool.

The future development of 4TCT includes possible removal of multithreading, refactoring of dynamic elements, and an upgrade in code quality and robustness through the implementation of testing suites and type hints. Additionally, there is potential to extend the tool to gather image data, although this feature currently poses challenges in terms of copyright and legality vetting.

The release of 4TCT offers researchers a valuable tool for gathering large amounts of text data from 4chan, aiding in the analysis of the platform's community and providing insights for sociological research. With its ease of use and potential benefits, it is hoped that this tool will facilitate individual researchers and teams in their endeavors to study 4chan and related topics.

Written by Jack H. Culbert
