Method for digitizing documents as seen in the videos above
The first step in the process of digitizing the documents was to convert all photos into high-contrast black and white scans which would increase the machine-readability of the contained text. To represent the process visually, the Android application CamScanner was used. However, since mobile resolutions do not translate well to the big screen and since the screen-capture software ecosystem is more advanced on PC operating systems, the process was done on a Windows computer running an Android emulator. So in essence, an Android application was used on the computer and the process was screen-captured in high quality. The process involved manually detecting the edges of the document and having the software separate it from its background and color-correct accordingly.
After this, a NodeJS script was put together to locally extract the text of these high-contrast scans. The point to consider here was to eliminate the need for cloud-processing so that the documents stay private during analysis. To perform the OCR (optical character recognition), the popular OCR framework Tesseract was used. The script first looked at the contents of the folder entitled “data” in the same folder as the code. Then, it listed all the images and randomized their order to add some unpredictable spontaneity to the process - after which it chose images one by one, and then detected lines, words, and individual letters. Since the algorithm only detects characters and has no knowledge of correct spelling of words, some phrases may be rendered as nonsense. To mitigate these small errors in the detection of individual letters, the detected words were run through a spell-checker that corrects for misinterpreted characters. The spell-checked words were then joined together to construct the paragraphs - as they appear in the original document. The final image and text results were then saved in a text file of the same name as the corresponding original document.
After hours of algorithm analysis, the script produced the intended results with some exceptions. Since the documents are typewritten (analogs) they do have an inconsistent word and character spacing, the script split words or interpreted mid-word line-breaks as two distinct words. The inaccuracies are simple to detect and fix using a word processor and we experimented with this; in the end, delivering the imperfect 'testimony' of the algorithm image processing was the most interesting option as this became, somewhat, a nonhuman testimony.
To represent the process of character recognition visually, the script announces its progress continuously and also displays the resulting text before saving it into a text file and moving on to the next document.