CorDeep is a machine-learning based web application to extract visual elements from historical sources and to classify pages that contain numerical and alphanumerical tables. It locates and classifies visual elements into the following categories: “Content Illustrations,” “Initials,” “Decorations,” and “Printers's Marks”. CorDeep is trained on the Sphaera corpus, which is a collection of 359 early modern treatises containing about 78,000 pages, 30,000 visual elements, and 10,000 pages containing tables. The collection is constituted by early modern textbooks on geocentric cosmology (https://sphaera.mpiwg-berlin.mpg.de). The visual elements were manually annotated with bounding boxes and semantic labels whereas the pages with tables were identified semiautomatically by an incrementally improved model supervised by a human expert. CorDeep reaches an average precision of up to 98% concerning the detection of visual elements and an accuracy of 94% concerning the classification of pages containing tables. These values might change depending on the style, content, and quality of inputted images.
CorDeep accepts PDFs, images (all common file formats accepted, e.g., JPEG, PNG, TIFF, WEBP, etc.), and IIIF manifest JSON provided as a file or URL. CorDeep provides a wide range of export formats to download the extracted visual elements:
Users are responsible for the validity and use of the extracted data. User assumes the risk for any losses or damages in the use of this service.
Any use of data provided by CorDeep has to be credited by citing the following publication: Büttner, J.; Martinetz, J.; El-Hajj, H.; Valleriani, M. CorDeep and the Sacrobosco Dataset: Detection of Visual Elements in Historical Documents. J. Imaging 2022, 8, 285. https://doi.org/10.3390/jimaging8100285
The Sacrobosco Visual Elements Dataset (S-VED) that was used to train the visual element extraction model is available on Zenodo.
CorDeep is funded by the "Berlin Institute for the Foundations of Learning and Data" (https://bifold.berlin) (Ref. 01IS18025A and Ref. 01IS18037A) and was developed in the framework of the Research Project The Sphere. Knowledge System Evolution and the Shared Scientific Identity of Europe (https://sphaera.mpiwg-berlin.mpg.de), based at the Max Planck Institute for the History of Science, Berlin.