Learning to Extract & Understand Twitter Data


I’m currently working on a research to document and explore the Twitter resistance movement that’s formed in the wake of the Trump administration. In order to visualize, save, and explore the vast number of tweets that have appeared under various accounts and hashtags during this time, I decided to turn to Twarc.

Developed by Ed Summers for the Documenting the Now project, Twarc is a command line tool and Python library for archiving Twitter JSON data. Using the Twitter API, users can collect tweets, hashtags, trends, followers, friends, retweets, replies…basically, anything publicly available on Twitter can be requested here. Using various commands, you can even set up libraries of certain hashtags and accounts, to track trending information. Twarc is one of four primary tools developed by Documenting the Now to work with Twitter data, all with varying levels of technical proficiency. Twarc, like these other tools, reflect an effort to chronicle historically significant events and consider ethical ways of working with social media content. Pitched from a mindset geared towards “archivists of the future,” Twarc offers a way to think about collecting and archiving Twitter data in forms that prioritize context, safety, and usability. And though other tools of this type may exist for this type of work, Twarc seems best prepared to handle long-term curation and expansive requests of Twitter data. In addition, the DocNow team behind its use champions many of the questions around social media activism that may be placed in conversation with my aims for this project.

Read the whole post here.

revision history for this page