CorDeep

What is CorDeep?

CorDeep is a machine-learning based web application to extract visual elements from historical sources and to classify pages that contain numerical and alphanumerical tables. It locates and classifies visual elements into the following categories: “Content Illustrations,” “Initials,” “Decorations,” and “Printers's Marks”. CorDeep is trained on the Sphaera corpus, which is a collection of 359 early modern treatises containing about 78,000 pages, 30,000 visual elements, and 10,000 pages containing tables. The collection is constituted by early modern textbooks on geocentric cosmology (https://sphaera.mpiwg-berlin.mpg.de). The visual elements were manually annotated with bounding boxes and semantic labels whereas the pages with tables were identified semiautomatically by an incrementally improved model supervised by a human expert. CorDeep reaches an average precision of up to 98% concerning the detection of visual elements and an accuracy of 94% concerning the classification of pages containing tables. These values might change depending on the style, content, and quality of inputted images.

How does it work?

CorDeep accepts PDFs, images (all common file formats accepted, e.g., JPEG, PNG, TIFF, WEBP, etc.), and IIIF manifest JSON provided as a file or URL. CorDeep provides a wide range of export formats to download the extracted visual elements:

Raster data (original pages, original pages with visual elements labeled with bounding box and semantic label, and cropped visual elements).
CSV file with bounding box and label information.
W3C Web Annotation Collection.

CorDeep offers all common bounding box coordinate annotation types either as relative or absolute

iThe absolute coordinates do not correspond to the original image size. Image data is resized before the upload hence absolute coordinates correspond to the resized images as you can then download them.

coordinates:

XYXY.
i1st X: x-coordinate of left border of bounding box.

2nd X: x-coordinate of right border of bounding box.

1st Y: y-coordinate of top border of bounding box.

2nd Y: y-coordinate of bottom border of bounding box.
XYWH.
iX: x-coordinate of left border of bounding box.

Y: y-coordinate of top border of bounding box.

W: width of bounding box.

H: height of bounding box.
XCYCWH.
iXC: x-coordinate of center of bounding box.

YC: y-coordinate of center of bounding box.

W: width of bounding box.

H: height of bounding box.

CorDeep allows you to download a CSV file of the pages that contain numerical and alphanumerical tables.
CorDeep does not permanently store uploaded material and extracted data.
CorDeep does not collect any user data.

Try the extraction tool here:
Extract Visual Elements

Disclaimer

Users are responsible for the validity and use of the extracted data.
User assumes the risk for any losses or damages in the use of this service.

Credits

Any use of data provided by CorDeep has to be credited by citing the following publication:
Büttner, J.; Martinetz, J.; El-Hajj, H.; Valleriani, M. CorDeep and the Sacrobosco Dataset: Detection of Visual Elements in Historical Documents. J. Imaging 2022, 8, 285. https://doi.org/10.3390/jimaging8100285

Data

The Sacrobosco Visual Elements Dataset (S-VED) that was used to train the visual element extraction model is available on Zenodo.

Support

CorDeep is funded by the "Berlin Institute for the Foundations of Learning and Data" (https://bifold.berlin) (Ref. 01IS18025A and Ref. 01IS18037A) and was developed in the framework of the Research Project The Sphere. Knowledge System Evolution and the Shared Scientific Identity of Europe (https://sphaera.mpiwg-berlin.mpg.de), based at the Max Planck Institute for the History of Science, Berlin.