4 min read

Image recognition (OCR) with Tesseract by Google

Recently I have read on R weekly post about using image recognition engine by Google tesseract package. This package provides tool for optical character recognition (OCR) - simply allows to retrieve text from images. Tesseract is used for text detection on mobile devices, in video, and in Gmail image spam detection. How cool is that? Let´s try it too.

library(tidyverse)
library(magick)
library(tesseract)

Sample image

We need some testing image. I´m curious especially how well it perform on product labels like serial numbers, etc. (not the barcode, just the text). No need to think a lot - let´s use first one by google search “product label serial number”.

I have no idea what product is this label related to, but it will serve our purpose.

We are interested in P/N and S/N so the expected outcome will be:

  • P/N: 8-01532-01
  • S/N: QTMCHOU-12345AB678

First shot - no image preprocessing

path <- "../../static/data/label.png"

Just read the image and pass it to ocr function.

path %>%
  image_read() %>% 
  ocr()
## [1] "HO TT AMY ASSemmbbtedt im NINN I\nPIN: 8-01532-01 United States REV:A\nService Call nT Serial Number\nNIT TT\nSystem S/N: QTMCHOU-12345AB678\n"

We can see some successfully recognized strings delimited with \n. Let´s tweak it a bit into tibble.

path %>%
  image_read() %>% 
  ocr() %>% 
  str_split("\n") %>% 
  unlist %>% 
  as_tibble()
## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.
## # A tibble: 6 x 1
##   value                              
##   <chr>                              
## 1 HO TT AMY ASSemmbbtedt im NINN I   
## 2 PIN: 8-01532-01 United States REV:A
## 3 Service Call nT Serial Number      
## 4 NIT TT                             
## 5 System S/N: QTMCHOU-12345AB678     
## 6 ""

Now we have tibble line by line as recognized from image. Obviously the barcode is mistakenly recognized as text. Besides notice that “P/N:” is recognized as “PIN:” - not exactly what we want.

Tune just a bit - including image preprocessing

Package magick comes into play now with resizing, converting image into grayscale (not needed in this case), contrasting and enhancing the image in order to help the model more easily recognize the letters.

df <- path %>%
  image_read() %>% 
  image_resize("2000x") %>% 
  image_convert(type="grayscale") %>%
  image_contrast() %>%
  image_enhance() %>%
  ocr() %>% 
  str_split("\n") %>% 
  unlist %>% 
  as_tibble()
df
## # A tibble: 6 x 1
##   value                                    
##   <chr>                                    
## 1 WOM AO VOM LS LIE ASSermattet ire MMII AN
## 2 P/N: 8-01532-01 United States REV:A      
## 3 Service Call System Serial Number        
## 4 lil i a HMMM MON OL A                    
## 5 System S/N: QTMCHOU-12345AB678           
## 6 ""

It is close to what we need. We can see correctly recognized “P/N:”. As a next step my first naive approach was filtering “P/N” or “S/N” and then “somehow” get rid off “System” word. However it would be quite hardcoded (but easy). So my instinct tells me - regular expressions:). Here I remembered the famous quote:)

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.  - Jamie Zawinski

It turned out be the most difficult part to solve.

df %>% 
  mutate(regex = str_extract(value, "[^a-z]/N: [\\S]*"))
## # A tibble: 6 x 2
##   value                                     regex                  
##   <chr>                                     <chr>                  
## 1 WOM AO VOM LS LIE ASSermattet ire MMII AN <NA>                   
## 2 P/N: 8-01532-01 United States REV:A       P/N: 8-01532-01        
## 3 Service Call System Serial Number         <NA>                   
## 4 lil i a HMMM MON OL A                     <NA>                   
## 5 System S/N: QTMCHOU-12345AB678            S/N: QTMCHOU-12345AB678
## 6 ""                                        <NA>

Simple but powerfull. Used regular expression can be described basically as find “/N:” and take any letter before ([^a-z]) and any non-white space after ( [\\S]*).

Voilà

Now we can finally tune the output to get exactly expected outcome:

df %>% 
  mutate(regex = str_extract(value, "[^a-z]/N: [\\S]*")) %>% 
  filter(!is.na(regex)) %>% 
  separate(regex, " ", into = c("key", "value"))
## # A tibble: 2 x 2
##   key   value             
##   <chr> <chr>             
## 1 P/N:  8-01532-01        
## 2 S/N:  QTMCHOU-12345AB678

BTW

There is even a function ocr_data providing confidence how well each word was recognized.

path %>%
  image_read() %>% 
  image_resize("2000x") %>% 
  image_convert(type="grayscale") %>%
  image_contrast() %>%
  image_enhance() %>%
  ocr_data() %>% 
  arrange(desc(confidence)) %>% 
  print(n=10)
## # A tibble: 29 x 3
##    word               confidence bbox             
##    <chr>                   <dbl> <chr>            
##  1 System                   96.6 542,236,791,294  
##  2 Service                  96.5 112,236,362,294  
##  3 Serial                   96.3 820,236,1008,294 
##  4 System                   96.1 116,402,360,473  
##  5 Call                     95.6 389,236,514,294  
##  6 Number                   95.4 1039,236,1305,294
##  7 P/N:                     93.7 198,122,324,215  
##  8 8-01532-01               92.2 642,126,983,178  
##  9 S/N:                     90.3 386,402,524,458  
## 10 QTMCHOU-12345AB678       89.4 616,401,1454,459 
## # ... with 19 more rows