Let's Adore Jesus-Eucharist! | Home >> Varia >> Software Engineering
Overview of the book digitizing process
Here's an overview of the process I use (please tell me if you have a better method). Each step will be explained below:
1.1) Intellectual property. Before putting a book into the public domain, we must make sure it is no longer copyrighted, or that the owners have given their written permission.
1.2) Digital photography ("Scanning"). An electronic device converts the paper pages of the book into large files containing raw pictures of those paper pages.
1.3) Optical Character Recognition (OCR). A software takes the raw pictures of the pages and interprets them as best it can, transforming the ink blots into letters of the alphabet.
1.4) Correction. A reviewer compares the raw image with the OCR output, and makes the necessary corrections.
1.5) HTML or other encoding. This part of the process is variable. It can include formatting of the text (italics, bold, styles, etc.), the organization of the logical elements (footnotes, table of contents, index, etc.), and the features proper to digital books (hyperlinks, animations, etc.).
Ideally, you contact the copyright holders and get their written permission. If you can't find them, or if they don't answer you, at least try to gather the "legal evidence" which might eventually help you show a court of law that you did what you could to get this permission. (See for example the Open Letter To Desclée de Brouwer.)
3.1) Get a scanner. I guess just about any device will do. My scanner (Canon Canoscan LiDE 80) was one of the simpler and less expensive models on the market, purchased for about 150$ many years ago. Since then, prices have just gone down and features up, so you can't really miss! By the way, you can pay a lot of money for an automatic document feeder, but I'd say it's useless, since the pages of old books which are out of print (precisely the kind of book you tend to want to digitize!) often have formats and thicknesses which are not amenable to automatic feeding.
3.2) Get scanning software. Often, they come bundled with the scanner. That is how I got my ScanSoft Omnipage SE software, which is adequate for me. This program also does the OCR.
3.3) Prepare the book. Unfortunately, so far I've had good results by cutting of the book's spine, which separates all the pages and let's me position them on the scanner's bed. Here's a method that works: (1) cut off the cover; (2) put the book on the edge of a workbench; (3) set a metal ruler near the book's spine (you have to be far enough from the spine to cut all pages free, but not too far to avoid lopping off some text); (4) clamp down everything tightly with two bar clamps; (5) cut with a utility knife (Xacto, Olfa, etc.); (6) after scanning, you can sandwich the free pages between the two cut covers, and hold everything with elastics. You have to keep the original as long as the digitizing process is not finished. Of course, if the book doesn't belong to you, don't cut it! I've scanned a book without cutting it up, but it is unenjoyable and gives poor results.
3.4) Do the actual scanning. Contrary to appearances, this is one of the fastest and most pleasant parts of the process, so enjoy it! You start the software and the scanner, and manually place the pages on the scanner's bed, one after the other. Stop every 50 pages, and save your work, in case the software crashes (it's happened to me several times). Name the files according to the pages they contain, like for example "pages 0001 to 0060.opd". See your scanner's Manual for details.
At the end of this step, some people declare victory and just plop those huge files onto the Internet, claiming that book has now been "digitized". I think that's verbal inflation: that book has just been photographed, not really digitized. A really digitized book requires much more work, but also has many more advantages (like compactness, ease of indexing, ease of corrections and additions, etc.).
This is probably the most complicated part of the process, but fortunately this complexity is hidden inside the OCR software. All you need to do is push a button and the computer does (almost!) all the work. The software will ask questions when it will be unable to recognize certain words. In the following example, the software is confused because there is no dot on the "i":
Example of an OCR error
This is the part of the process that normally stands the most to profit from more recent versions of an OCR program. At the end of the OCR step, the software produces a file with the characters it has managed to recognize.
I've never come across an OCR program that avoided all errors. I guess it might be even theoretically impossible (think about pages that are torn, or vandalized by scribbling, or think about printing errors, etc.). You have to read the text file produced by the OCR software, and compare it to the raw digital image. In practice, this step is done in two stages.
5.1) First "rough cut", more or less automatically done. After the OCR but before encoding, you use the features of your word processor to make as many "global" corrections as possible. For example, the books by Thonnard don't put accents on the "À" which leads a sentence (something which should be done in French), so you can do a semi-automatic search-and-replace and correct all occurences, etc.
5.2) The "monastic" correction. That's the real correction, the one which is a monk's labor! I tend to proofread one paragraph at a time, and integrate this step with the following one (see #6.6 here below).
This is the longest and most difficult part of the process, since you're not just translating anymore (from a paper format to an electronic format), but producing something new (we often add things which don't even exist in the original work).
There are in theory at least three approaches to this step:
- save the OCR output as an HTML file, and fix the incorrect or missing HTML;
- save it into some ordinary word processor format (like Microsoft Word),
then make changes, and finally use the ability of the word processor to
save that file into an HTML format;
- save it as a plain ASCII text file, and encode the HTML from scratch.
I've tried those three approaches. As far as I know, it is currently less work to redo the HTML from scratch, if you want the result to be squeeky-clean. Of course, newer versions of software could change this conclusion in the future.
Here is what your computer screen might look like while you're doing this step:
Screen shot during encoding
If you decide to encode the book into HTML (my current recommendation), you can read a little tutorial on this topic (see for example HTML for Grannies). If you're hand-coding HTML tags, you're probably wasting your time. A good word processor will let you select the piece of text you want to encode (for example a word you want to italicize), hit a key combination, which will start a marco which will do the work for you.
Here's an overview of the steps I used to digitize the Précis de philosophie by F.-J. Thonnard. (But keep in mind that this part of the process can be highly variable from one book to the next, and from one person to the next):
6.1) Tag off a paragraph. I tend to do these steps paragraph by paragraph.
6.2) Add formatting. For example, italicize and put in bold according to the original.
6.3) Add the footnotes, hyperlinks, etc. Often, the OCR software isn't able to correctly recognize footnotes. These days, I do it this way: (1) double-clic on the "footnote source" template in the "palette of text snippets" (bottom left-hand corner in the screen shot above); (2) cut the footnote text and past it in the HTML page that contains all the footnotes; (3) add the "footnote target" text snippet (called "02_texte_note" in the bottom left-hand corner of the screen shot); (4) give the correct number to the footnote; (5) come back to the first page and set the same footnote number.
6.4) Remake or scavenge the images, pictures, drawings, etc..
6.5) Format the HTML code. (optional) You can use a macro that justifies the text. This code formatting doesn't appear when viewing the page with a web browser, but it's more polite for the other programmers who some day might have to come and modify your HTML.
6.6) Do the final "monastic" correction. Not only should you do this correction, but ideally, another person should review your work.
Tom Gilb says: "If you don't know what you're doing, don't do it on a large scale", and Jon Bently adds: "It's faster to make a 4-inch telescope mirror, then a 6-inch mirror, than to try to make a 6-inch mirror right away".
I was lucky enough to apply that advice to book digitizing. I started with a simple 20-page book by Courtois, then graduated to a 200-pager by Sertillanges, and only then did I attack the book I really wanted to digitize, a 2000-page monster by Thonnard. I strongly recommend you follow a similar approach.
Let's Adore Jesus-Eucharist! | Home >> Varia >> Software Engineering