Lessons in Scaling A Large Digital Library: A Case Study For CiteSeerX

Open Access
Author:
Jordan, Douglas William
Area of Honors:
Computer Science
Degree:
Bachelor of Science
Document Type:
Thesis
Thesis Supervisors:
  • Clyde Lee Giles, Thesis Supervisor
  • John Joseph Hannan, Honors Advisor
Keywords:
  • scaling
  • cloud computing
  • digital libraries
  • architecture
  • systems
Abstract:
The current generation of CiteSeerX is incredibly popular. On an average day, the website has over 2 million hits, and 100,000 downloads. CiteSeerX currently has indexed over 7 million papers, and is ingesting new papers at a rate of 5,000 per day. While the service has been able to scale to this point, over the past year it has been approaching the limit of the current architecture. The page load times have been increasing and the ingestion of new papers has slowed. It is believed that both of these issues are largely due to the GFS file system and MySQL database. In this thesis, we investigate re-architecting the back-end of CiteSeerX.