EzDevInfo.com

ocr interview questions

Top ocr frequently asked interview questions

OCR Image recognition plugin for firefox and chrome? [closed]

Sometimes i need to OCR images that I stumble upon on some webpages. I would like to know if there are any plugins for Firefox and Chrome that would allow me to upload desired image for processing on their servers, recognize it and send me back the result.


Source: (StackOverflow)

Optimal lossy compression of images

I have a lot of images/documents, where I want quite a low filesize, without throwing away to much information, or cause generation loss with future compression.

A) Documents, business cards etc where I want to be able to read or OCR the text. Also some information like color-resolution might be less important.

B) Regular photos (Holidays etc.) Where I want to preserve sufficient details. (But the filesize from the camera is probably much larger than needed)

Are there any tools that can assist me with doing this?

(29 April) More about finding the optimal size/loss than which technique is most efficient.


Source: (StackOverflow)

Advertisements

Image to Text converter [closed]

I need a software which can convert scanned text to editable text. I will prefer freeware?


Source: (StackOverflow)

How to create documents from scanned images

I have a large number of Micorsoft Word documents to create, after a disk crash and patchy backups, destroyed the originals.

We have a fair amount of the originals left, and instead of manually recreating them, I want to maybe scan then, use something to capture the image and convert it to word 2007, is possible and them polish the final document.

Possibly use OCR scanning software?

Anyway to do it?


Source: (StackOverflow)

Is there a utility to do OCR on images on the windows clipboard? [closed]

Sometimes, I find myself typing out a lot of text from a screen capture. Its quite tedious.

Is there an OCR (Optical Character Recognition) program out there that would allow me to quickly convert the something like a screen capture, or the contents of the Windows clipboard (a bitmap), into text?


Source: (StackOverflow)

Command-line OCR in Windows 7

What are some command-line OCR utilities that will work in Windows 7 64-bit?


Source: (StackOverflow)

How to extract text with OCR from a PDF on Linux?

How do I extract text from a PDF that wasn't built with an index? It's all text, but I can't search or select anything. I'm running Kubuntu, and Okular doesn't have this feature.


Source: (StackOverflow)

How can I convert scanned images as PDF to a searchable PDF file? [closed]

I have a PDF of a scanned book.

I'm looking for a free software that will perform OCR and then provide an option to save it as a PDF or document again.

Is there one?


Source: (StackOverflow)

Batch-OCR many PDFs

This has been discussed a year ago here:

Batch OCR for many PDF files (not already OCRed)?

Is there any way to batch OCR PDFs that haven't been already OCRed? This is, I think, the current state of things dealing with two issues:

Batch OCR PDFs

Windows

  • Acrobat – This is the most straightfoward ocr engine that will batch OCR. The only problem seems to be 1) it wont skip files that have already been OCRed 2) try throwing a bunch of PDFs at it (some old) and watch it crash. It is a little buggy. It will warn you at each error it runs into (though you can tell the software to not notify. But again, it dies horribly on certain types of PDFs so your mileage may vary.

  • ABBYY FineReader (Batch/Scansnap), Omnipage – These have got to be some of the worst programmed pieces of software known to man. If you can find out how to fully automate (no prompting) batch OCR of PDFs saving with the same name then please post here. It seems the only solutions I could find failed somewhere--renaming, not fully automated, etc. etc. At best, there is a way to do it, but the documentation and programming is so horrible that you'll never find out.

  • ABBYY FineReader Engine, ABBYY Recognition Server - These really are more enterprise solutions, you probably would be better off just getting acrobat to run over a folder and try and weed out pdfs that give you errors/crash the program than going through the hassle of trying to install evaluation software (assuming you are a simple end-user). Doesn't seem cost competitive for the small user.

  • ** Autobahn DX workstation ** the cost of this product is so prohibitive, you probably could buy 6 copies of acrobat. Not really an end-user solution. If you're an enterprise setup, this may be worth it for you.

Linux

  • WatchOCR – no longer developed, and basically impossible to run on modern Ubuntu distros
  • pdfsandwich – no longer developed, basically impossible to run on modern Ubuntu distros
  • ** ABBY LINUX OCR ** - this should be scriptable, and seems to have some good results:

http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

However, like a lot of these other ABBYY products they charge by the page, again, you might be better off trying to get Acrobat Batch OCR to work.

  • *Ocrad, GOCR, OCRopus, tesseract, * – these may work but there are a few problems:

    1. OCR results are not as great as, say, acrobat for some of these (see above link).
    2. None of the programs take in a PDF file and output a PDF file. You have to create a script and break apart the PDF first and run the programs over each and then reassemble the file as a pdf
    3. Once you do, you may find, like I did, that (tesseract) creates an OCR layer that is shifted over. So if you search for the word 'the', you'll get a highlight of the part of the word next to it.
  • Batch DjVu → Convert to PDF – haven't looked into it, but seems like a horrible round-a-bout solution.

Online

  • PDFcubed.com – come on, not really a batch solution.
  • ABBYY Cloud OCR - not sure if this is really a batch solution, either way, you have to pay by the page and this could get quite pricey.

Identifying non-OCRed PDFs

This is a slightly easier problem, that can be solved easily in Linux and much less so in Windows. I was able to code a perl script using pdffont to identify whether fonts are embedded to determine which files are not-OCRed.


Current "solutions"

  1. Use a script to identify non-OCRed pdfs (so you don't rerun over thousands of OCRed PDFs) and copy these to a temporary directory (retaining the correct directory tree) and then use Acrobat on Windows to run over these hoping that the smaller batches won't crash.

  2. use the same script but get one of the linux ocr tools to properly work, risking ocr quality.

I think I'm going to try #1, I'm just worried too much about the results of the Linux OCR tools (I don't suppose anyone has done a comparison) and breaking the files apart and stitching them together again seems to be unnecessary coding if Adobe can actually batch OCR a directory without choking.

If you want a completely free solution, you'll have to use a script to identify the non-OCRed pdfs (or just rerun over OCRed ones), and then use one of the linux tools to try and OCR them. Teseract seems to have the best results, but again, some of these tools are not supported well in modern versions of Ubuntu, though if you can set it up and fix the problem I had where the image layer not matching the text-matching layer (with tesseract) then you would have a pretty workable solution and once again Linux > Windows.


Do you have a working solution to fully automate, batch OCR PDFs, skipping already OCRed files keeping the same name, with high quality? If so, I would really appreciate the input.


Perl script to move non-OCRed files to a temp directory. Can't guarantee this works and probably need to be rewritten, but if someone makes it work (assuming it doesn't work) or work better, let me know and I'll post a better version here.


#!/usr/bin/perl

# move non-ocred files to a directory
# change variables below, you need a base dir (like /home/joe/), and a sourcedirectory and output
# direcotry (e.g books and tempdir)
# move all your pdfs to the sourcedirectory

use warnings;
use strict;

# need to install these modules with CPAN or your distros installer (e.g. apt-get)
use CAM::PDF;
use File::Find;
use File::Basename;
use File::Copy;

#use PDF::OCR2;
#$PDF::OCR2::CHECK_PDF   = 1;
#$PDF::OCR2::REPAIR_XREF = 1;

my $basedir = '/your/base/directory';
my $sourcedirectory  = $basedir.'/books/';
my @exts       = qw(.pdf);
my $count      = 0;
my $outputroot = $basedir.'/tempdir/';
open( WRITE, >>$basedir.'/errors.txt' );

#check file
#my $pdf = PDF::OCR2->new($basedir.'/tempfile.pdf');
#print $pdf->page(10)->text;



find(
    {
        wanted => \&process_file,

        #       no_chdir => 1
    },
    $sourcedirectory
);
close(WRITE);

sub process_file {
    #must be a file
    if ( -f $_ ) {
        my $file = $_;
        #must be a pdf
        my ( $dir, $name, $ext ) = fileparse( $_, @exts );
        if ( $ext eq '.pdf' ) {
            #check if pdf is ocred
            my $command = "pdffonts \'$file\'";
            my $output  = `$command`;
            if ( !( $output =~ /yes/ || $output =~ /no/ ) ) {
                #print "$file - Not OCRed\n";
                my $currentdir = $File::Find::dir;
                if ( $currentdir =~ /$sourcedirectory(.+)/ ) {
                    #if directory doesn't exist, create
                    unless(-d $outputroot.$1){
                    system("mkdir -p $outputroot$1");
                    }
                    #copy over file
                    my $fromfile = "$currentdir/$file";
                    my $tofile = "$outputroot$1/$file";
                    print "copy from: $fromfile\n";
                    print "copy to: $tofile\n";
                    copy($fromfile, $tofile) or die "Copy failed: $!";
#                       `touch $outputroot$1/\'$file\'`;
                }
            }

        }

    }
}

Source: (StackOverflow)

Extract OCR text from Evernote

Evernote does OCR on the images you save to it. Is there a way to get the full text equivalent for an image in Evernote, or is the OCR only for searching?


Source: (StackOverflow)

How can I identify fonts from an image? [closed]

Many times I come across bitmaps with nothing but text paragraphs, so I was looking for a way to identify the font used, the paragraph alignment, line spacing and color, bold, italics.

Would an OCR package allow me to do that?

If not, what other options do I have?


Source: (StackOverflow)

Is there a better way to correct errors in Adobe Acrobat's OCR results?

I'm using OCR Text Recognition integrated into Adobe Acrobat Pro 8 to produce an (invisible) searchable text overlay for text pages I have scanned. This is very neat for copying some phrases to the clipboard or for text searching.

In some cases, Adobe does a rather poor job and in some cases it just produces a few typos, making the corresponding words or sentences un-searchable. In the Adobe Forums, user strontium87 explains that you can manually show the text and then modify with the Touchup Text tool, before setting it to invisible again. Since this method is quite cumbersome - does anyone know of an easier way to do this? Maybe with an external tool?


Source: (StackOverflow)

How to replaces images of text in PDFs with formatted text using OCR

I get a lot of PDFs from other people consisting of scanned old documents. Unfortunately, sometimes the text on the scans, though legible, looks grainy and is hard to read.

What I've been able to do so far is to extract the text, using OCR, into a word document. However, since these old documents often have illustrations and intricate formatting, what I'd really like to be able to do is to just remove the old grainy text and substitute it with computer generated fonts. In other words, I'd like to preserve the PDF and the formatting of its pages to the greatest extent possible while "cleaning" up the text by replacing it with, say, times new roman.

I've been looking online for a few days for a simple, automatable way to perform such a cleanup, and I haven't turned up anything so far. It definitely seems like there should be a way to do this, it doesn't seem that complicated, but maybe I'm overlooking some aspects of this problem that place it outside of what is currently doable with OCR.

Any suggestions?


Source: (StackOverflow)

OCR Tesseract, Empty page error?

I compiled it from sources with leptonica. This is a png image with transparent background, which I edited adding a blue color and still this error:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Empty page!!
Empty page!!

Here's the image input:

enter link description here


Source: (StackOverflow)

How to automatically find non-searchable PDFs

Suppose I have a directory full of many PDFs. In most of them, the text is completely search-able, which is the way I need them to be. But a few of them are just image scans, and they need to be OCR-ed.

Other then simply doing a batch OCR on the entire directory, is there a way to quickly identify which PDFs are the image-only ones that actually need to be OCR-ed?

I'm not a programmer, but a linux-friendly solution would be preferred.


Source: (StackOverflow)