ChemSchematicResolver is an open-source python package for automatically extracting chemical structures from schematic diagrams.
Check out the documentation to get started, or look below to see how it works.
Figure captions within scientific documents are mined with ChemDataExtractor to identify Figures that might contain chemical schematic diagrams.
ChemSchematicResolver uses ChemDataExtractor to automatically detect relevant images from within Figures of a given HTML or XML document. The images for these Figures are downloaded locally for extraction.
Chemical schematic diagrams and text labels are identified, classified and assigned to each other.
First, the image is sub-divided into regions of interest.
Small gaps in the image are blurred over to ensure that each region contains all features in the near vicinity.
The connected paths of these blurred pixels are then used to locate the regions of interest.
The regions of interest are classified as labels or diagrams using a k-means clustering algorithm.
Diagrams and labels are paired up via a proximity-driven algorithm.
The pairs are shown in the same colour.
Next, diagrams and labels are resolved into a machine-readable format.
Optical Character Recognition (OCR) is applied to each label to convert it into a text string, which is then tokenized.
R-group structures use variables (such as R) in place of certain atoms in the chemical diagram. The values that R can take are described in the label.
Resolved labels are scanned for features that indicate R-group structures, and the relevant values are extracted by ChemSchematicResolver.