Design and Implementation of an Information Retrieval System for Academic Publications
Open Access
Author:
Chhay, Jason
Area of Honors:
Computer Science
Degree:
Bachelor of Science
Document Type:
Thesis
Thesis Supervisors:
C Lee Giles, Thesis Supervisor Ting He, Thesis Honors Advisor
Keywords:
CiteSeerX Search Engine COVID-19 Information Retrieval Web application Web development Elasticsearch Search Interface
Abstract:
This thesis outlines the design and implementation for a web-based information retrieval system specifically for academic journals and publications. More specifically, this thesis will focus on improving and sustaining the existing infrastructure of CiteSeerX, hosted by the Penn State College of Information Sciences and Technology. The current implementation of CiteSeerX is analyzed from the process of document crawling, information extraction and ingestion, document indexing, and a web-based search interface face. A selection of new potential features is implemented and prototyped through COVIDSeer, a small-scale search interface built on the CORD-19 dataset to assist the global COVID-19 pandemic research effort. These features are then transferred into a prototype for a future iteration of CiteSeerX that incorporates modern programming languages and frameworks for more efficient querying and a more maintainable codebase. This thesis should also serve to highlight the design and implementation challenges of COVIDSeer and the new system to assist in future work with developing similar search engines for academic publications.