Adobe - Optical Character Recognition & Exporting to Excel

How to Guides

When you create an Acrobat Document (PDF) from Word, the actual text of what was written is stored in that document. This allows the user to select text so it can copied and pasted into another document. It also allows newer versions of Microsoft Word to open that file so you can edit it. (Side note: While this works, Word doesn’t do a great job of preserving the formatting). However, when paper documents are scanned into a computer, the PDF file is created using an image, without the actual text.

Without the embedded text, the you can’t copy and paste, you can’t open the file in Word for editing and, perhaps most importantly, the document can’t be searched. Fortunately, Acrobat can analyze the picture, recognize the text and add it to the document.

Here’s how…..

We’ll use a sample document that was printed out and scanned in. The PDF document has been embedded so you can try it yourself.

adobe-sample.pdf

Open the file and you’ll see the chart of numbers below. If you try to select the text you’ll find that you can’t…..

{slider Optical Character Recognition (OCR)}

Scan Text with OCR (Optimal Character Recognition)

Click Tools and then Scan & OCR

adobe image2

Now click Recognize Text and then In this File:

adobe image3

Then click the Recognize Text Button

adobe image4

Acrobat will then recognize the text. Try highlighting, copying and pasting the text. You’ll see that now, you can!

Since this is a chart of numbers you might want to paste it into Excel.

Go ahead….Try it. You’ll get something that looks like the picture below, which isn’t all that useful.

adobe image5

{slider Export Data into Excel}

Export Data into Excel

Fortunately, Acrobat has a better way to get your document into excel.

In Acrobat, if you choose File > Export To > Spreadsheet > Microsoft Excel Workbook, you’ll be able to save the contents as a spreadsheet and open it in Excel.

As you can see, the result is much more useful:

adobe image6

One caveat that you should be aware of is that this process isn’t perfect. Acrobat is “reading” the text by looking at the lines in the document. If the scan isn’t clear, has smudges, watermarks or is distorted in any way, the software can read the document incorrectly. When working with text, this typically isn’t a big problem because spell check will catch most of the errors. However, when working with numbers, it’s important to verify that everything worked properly.

{/sliders}

How to Guides

Adobe - Optical Character Recognition & Exporting to Excel

Scan Text with OCR (Optimal Character Recognition)

Export Data into Excel

SAS-IT Support

Connect with Rutgers