Lessons in Scaling A Large Digital Library: A Case Study For CiteSeerX

Open Access
Jordan, Douglas William
Area of Honors:
Computer Science
Bachelor of Science
Document Type:
Thesis Supervisors:
  • Clyde Lee Giles, Thesis Supervisor
  • John Joseph Hannan, Honors Advisor
  • scaling
  • cloud computing
  • digital libraries
  • architecture
  • systems
The current generation of CiteSeerX is incredibly popular. On an average day, the website has over 2 million hits, and 100,000 downloads. CiteSeerX currently has indexed over 7 million papers, and is ingesting new papers at a rate of 5,000 per day. While the service has been able to scale to this point, over the past year it has been approaching the limit of the current architecture. The page load times have been increasing and the ingestion of new papers has slowed. It is believed that both of these issues are largely due to the GFS file system and MySQL database. In this thesis, we investigate re-architecting the back-end of CiteSeerX.