The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

Description
In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (https://github.com/diging/).

Details

Contributors
Date Created
2017-09-28
Resource Type
Language
  • eng
Note
  • The final version of this article, as published in The Journal of Open Research, can be viewed online at: https://openresearchsoftware.metajnl.com/articles/10.5334/jors.164/
Citation and reuse

Cite this item

This is a suggested citation. Consult the appropriate style guide for specific citation guidelines.

Damerow, J., Peirson, B. R., & Laubichler, M. D. (2017). The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents. Journal of Open Research Software, 5. doi:10.5334/jors.164

Additional Information
English
Series
  • JOURNAL OF OPEN RESEARCH SOFTWARE
Extent
  • 5 pages
Open Access
Peer-reviewed
Identifier
  • Digital object identifier: 10.5334/jors.164
  • Identifier Type
    International standard serial number
    Identifier Value
    2049-9647