So, how do we convert a physical book to its equivalent text document?

The document

In 2003, my distant relative, Stanley E. Perry, published his autobiography titled, “A Yank and the Australian 9th: Surviving Washing Machine Charlie.” To my knowledge, very few copies of the book were ever published; the copy that fell into my hands had been passed around my family for years, and had become quite…well read.

The pictures painted in Stanley’s words are nothing short of thrilling. Filled with happiness, sadness, thrill, fear, and other emotions, the author takes us on a personal journey through the confusion and hell of war and back. The author left contact information inside the copyright information, but all attempts to contact him and extended family have been met with silence. This incredible first-hand account deserves to be preserved — in its current condition, however, I felt the need to digitize the only copy I’d ever come across. This is a story of my digitization process.

Step 1: Capture

To avoid stressing the book’s spine, I began digitization by capturing high-resolution images of each page making sure to not over-stress the spine. This resulted in a series of images, the first three of which are illustrated below. Although each image used in this study was clear, the images below have been artificially blurred to make the text unreadable for copyright reasons.

From an image processing point of view, the problem immediately becomes difficult. The angles of each image, geometry of the book, and lack of the pages being pressed flat (ex. a photocopy machine) creates shadows, changing lighting conditions, and skews the words on each page, causing the words to become slanted. Furthermore, lighting conditions and optical aberrations caused words in the middle of the page to be in focus while words near the exterior of the page were blurred.

Step 2: Conversion

Artificial intelligence and machine learning have given mankind incredible tools that were out of our reach just years ago. Within this flood of technology, we’ve gained tools like Optical Character Recognition (OCR) that utilize image processing and machine learning techniques to analyze pixels of an image and identify patterns that form letters, numbers, and symbols. The patterns are then compared to a database of known characters, and assigns a value to each set of pixels.

Given the number of pages in the autobiography (~580), the process cries out for automation. Using the Python programming language enables a number of OCR packages including EasyOCR, Doctr, Keras-OCR, Tesseract, GOCR, Pytesseract, OpenCV, and AWS. But there’s another rarely-discussed alternative: Google drive. In this last option, images uploaded to Google Drive enable an option to create a Google Doc from the image text, effectively providing the OCR we’re looking for using while leveraging Google’s efficiency, accuracy, and server time. All that’s left for us to do is automate the routine!

Connecting to Google Drive

The first file performs a check of the user’s connection to the Google Drive API; note that there is some setup that needs to be performed on Google Drive for this code to work correctly! When being run for the first time, the script generates a token.json file to store the user’s credentials. If there are no credentials, however, the user has the ability to log in. Alternatively, the credentials can be refreshed or regenerated.

# If modifying these scopes, delete the file token.json.
SCOPES = ["https://www.googleapis.com/auth/drive.metadata.readonly"]

def main():
  """
  Shows basic usage of the Drive v3 API.
  Prints the names and ids of the first 10 files the user has access to.
  """
  creds = None
  # token.json stores the user's access and refresh tokens, and is created
  # automatically when the authorization flow completes for the first time.
  if os.path.exists("token.json"):
    creds = Credentials.from_authorized_user_file("token.json", SCOPES)

  # If there are no (valid) credentials available, let the user log in.
  if not creds or not creds.valid:
    # If credentials are expired or need to be refreshed,
    if creds and creds.expired and creds.refresh_token:
      creds.refresh(Request())
    # Otherwise, generate client credentials
    else:
      flow = InstalledAppFlow.from_client_secrets_file(
        "client_secret.json", SCOPES
      )
      creds = flow.run_local_server(port=0)

    # Save the credentials for the next run
    with open("token.json", "w") as token:
      token.write(creds.to_json())

  try:
    service = build("drive", "v3", credentials=creds)

    # Call the Drive v3 API
    results = (
        service.files()
        .list(pageSize=10, fields="nextPageToken, files(id, name)")
        .execute()
    )
    items = results.get("files", [])
    if not items:
      print("No files found.")
      return
    print("Files:")
    for item in items:
      print(f"{item['name']} ({item['id']})")

if __name__ == "__main__":
  main()

With valid credentials, the Google Drive API is called and the connection is established.

Iteratively process each image

The second script takes advantage of the credentials to carry out the rest of the process. The main function gives a quick summary of the user > Google Drive > conversion > download > user pipeline. Google drive credentials of the user are obtained, handed off to the access package, and the access is used to upload and download the user’s necessary files.

def main():
    credentials = get_credentials()
    http = credentials.authorize(httplib2.Http())
    service = discovery.build('drive', 'v3', http=http)

    # Use glob to find all image files in images directory
    filelist = glob('.\\images\\*.jpg')

    # Iteratively convert image files using Google Docs
    for file in sorted(filelist):
        convert_to_text(service, imgfile=file)

The get_credentials function reads the access_token from token.json and gets the user’s client secret. The credentials are returned to Python and used to authorize the httplib2 package that accesses Google Drive.

def get_credentials():
    """Gets valid user credentials from storage.

    If nothing has been stored, or if the stored credentials are invalid,
    the OAuth2 flow is completed to obtain the new credentials.

    Returns:
        Credentials, the obtained credential.
    """
    credential_path = os.path.join("./", 'token.json')
    store = Storage(credential_path)
    credentials = store.get()
    if not credentials or credentials.invalid:
        flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
        flow.user_agent = APPLICATION_NAME
        if flags:
            credentials = tools.run_flow(flow, store, flags)
        # Needed only for compatibility with Python 2.6
        else:
            credentials = tools.run(flow, store)
        print('Storing credentials to ' + credential_path)
    return credentials

A second function, convert_to_text, ingests the service information and the image file. The media file is uploaded, and a plain text file is created from the converted image information, particularly the data extracted (mimeType). The file is then deleted from the user’ Google Drive. Since the files are being downloaded to the local drive, convert_to_text does not return any information.

The downloaded files are done such that the text file and the image share the same name, but are placed in separate folders. For instance, a file located at ./images/087.png will end up as a text file located at ./text/087.txt. Maintaining the naming convention helps us to better process each text file.

def convert_to_text(service, imgfile: str):
    imgname = os.path.basename(imgfile)
    txtfile = '.\\text\\'+pathlib.Path(imgfile).stem+'.txt'

    mime = 'application/vnd.google-apps.document'
    res = service.files().create(
        body={
            'name': imgfile,
            'mimeType': mime
        },
        media_body=MediaFileUpload(imgfile, mimetype=mime, resumable=True)
    ).execute()

    downloader = MediaIoBaseDownload(
        io.FileIO(txtfile, 'wb'),
        service.files().export_media(fileId=res['id'], mimeType="text/plain")
    )
    done = False
    while done is False:
        status, done = downloader.next_chunk()

    service.files().delete(fileId=res['id']).execute()
    print(imgname + " completed...")

Further processing: Ebook

Initially, our image database contained one image of each page or surface of the book, excluding entirely blank pages. That means we also have images of the front cover (very little text), copyright (text with no story content), table of contents (text with no story content), and the back cover (no text). Once the functions complete, though, we have a file in text format (.txt) that needs to be converted to a document type capable of supporting rich text (eg., a Word document or .docx). Although I strongly prefer LaTeX, Word was easier to work with at this point.

In order to turn the document into an ebook, though, more processing needs to be done. Pictures need to be updated and optimized (though never with AI), captions need to be added, typographical errors need to be handled, section formatting needs to be corrected, and dropped, butchered, or other errors also needed attention. This effort itself became a labor of love, and one that took quite some time.

Conclusion

A story like Stanley Perry’s is one of achievement — not only in enduring one of the bloodiest wars in history, but also because of its ability to beat the odds in historical presevation. For a book with such little attention given to its publishing and distribution, the fact that A Yank in the Australian 9th has a webpage dedicated to it 80 years after the event took place is astounding.

OCS: Converting images to text