The Migration of Data and Refactoring of Large Scale Digital Libraries: A Case Study For CiteSeerX
Open Access
Author:
Parsons, Sean Walter
Area of Honors:
Information Sciences and Technology
Degree:
Bachelor of Science
Document Type:
Thesis
Thesis Supervisors:
Clyde Lee Giles, Thesis Supervisor Edward J Glantz, Thesis Honors Advisor
Keywords:
search engine digital libraries data migration SQL NoSQL Elasticsearch
Abstract:
CiteSeerX is one of the first academic digital libraries in the world and currently contains data on over 10 million academic documents. While the current technical architecture of CiteSeerX has scaled well to this point, there is a need to ingest more papers and utilize modern tools to increase efficiency. NoSQL datastores are examined in this thesis as well as new ways to represent relational data in non-relational databases. Additionally, in this thesis we compare the performance between Elasticsearch and MongoDB for our dataset and we propose a new indexing system for CiteSeerX.