Main image of article Library of Congress Offers Update on Huge Twitter Archive Project
Back in April 2010, the Library of Congress agreed to archive four years’ worth of public Tweets. Even by the standards of the nation’s most famous research library, the goal was an ambitious one. The librarians needed to build a sustainable system for receiving and preserving an enormous number of Tweets, then organize that dataset by date. At the time, Twitter also agreed to provide future public Tweets to the Library under the same terms, meaning any system would need the ability to scale up to epic size. Now, the Library of Congress is reporting the project is completed. “We now have an archive of approximately 170 billion tweets and growing,” Gayle Osterberg, the Library’s director of communications, wrote in a Jan. 4 posting on the Library of Congress Blog. “The volume of tweets the Library receives each day has grown from 140 million beginning in February 2011 to nearly half a billion tweets each day as of October 2012.” The Library’s system is completely automated, and relies on Gnip, which receives tweets in real time from Twitter and organizes the data into hour-long segments, which are then uploaded to a secure server throughout the course of the day. The Library then downloads each file to a temporary server space, with the appropriate checks for completeness and any errors, before recording statistics about the file and copying it to tape. The file is then deleted from the temporary server space. By October 2012, the Library was handling nearly 500 million tweets per day. The resulting archive is around 300 TB in size. But there’s still a huge challenge: the Library needs to make that huge dataset accessible to researchers in a way they can actually use—an effort that’s apparently a priority going forward. Researchers have made some 400 inquiries after the material, with associated topics ranging from citizen journalism to predicting stock market activity. But right now, even a single query of the 2006-2010 archive takes as many as 24 hours to execute, which limits those researchers’ ability to do work in a timely way. “Twitter is a new kind of collection for the Library of Congress but an important one to its mission,” Osterberg added. “As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications and other sources routinely collected by research libraries.” The Library of Congress faces the same issue as many companies trying to crunch tons of data into manageable form. On one hand, they want to crunch as much data as possible, the better to gain more refined insight into business processes; on the other, the sheer size of those datasets pushes the limits of available software and hardware. The Library of Congress, like those companies, can only depend on advancing technology to speed the whole process.   Image: Galina Mikhalishina/