About

ChemSchematicResolver is an open-source python package for automatically extracting chemical structures from schematic diagrams.

Check out the documentation to get started, or look below to see how it works.

Image Mining

Figure captions within scientific documents are mined with ChemDataExtractor to identify Figures that might contain chemical schematic diagrams.

Figure Scraping

ChemSchematicResolver uses ChemDataExtractor to automatically detect relevant images from within Figures of a given HTML or XML document. The images for these Figures are downloaded locally for extraction.

Feature Detection

Chemical schematic diagrams and text labels are identified, classified and assigned to each other.

Segmentation

First, the image is sub-divided into regions of interest.

Small gaps in the image are blurred over to ensure that each region contains all features in the near vicinity.

The connected paths of these blurred pixels are then used to locate the regions of interest.

Classification

The regions of interest are classified as labels or diagrams using a k-means clustering algorithm.

Match Chemicals and Labels

Diagrams and labels are paired up via a proximity-driven algorithm.

The pairs are shown in the same colour.

Resolve Structures

Next, diagrams and labels are resolved into a machine-readable format.

Reading Labels

Optical Character Recognition (OCR) is applied to each label to convert it into a text string, which is then tokenized.

Identify R-Group

R-group structures use variables (such as R) in place of certain atoms in the chemical diagram. The values that R can take are described in the label.

Resolved labels are scanned for features that indicate R-group structures, and the relevant values are extracted by ChemSchematicResolver.

Resolving Chemical Structures

Chemical diagrams are resolved using pyosra, a Python tool that extends the capabilities of OSRA to allows for R-group structures.

The software returns a simplified molecular line-entry system (SMILES) string which describes the chemical structure