<div dir="ltr">Have you seen the <a href="https://github.com/danielquinn/paperless">Paperless</a> project?  He&#39;s using Tesseract OCR as well.  There&#39;s another Paperless on github for MacOSX too</div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jul 19, 2016 at 5:05 PM, 4kbytes <span dir="ltr">&lt;<a href="mailto:4kbytes@zoho.com" target="_blank">4kbytes@zoho.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<br>

<br>

I am currently working with the Tesseract OCR. Tesseract is owned by Google with Apache 2.0 licensing.<br>

<br>

The issue I am running into is text accuracy.<br>

<br>

The current process: target text color to black, background to white, max contrast, pass to OCR.<br>

<br>

With documents from modern word processors this approach is accurate 98% of the time. When trying to read commercial serials or ID&#39;s, which are can be very compact, the result is accurate in count but not characters.<br>

<br>

Has anyone worked with this system before and know a possible solution? I am currently looking into ImageMagick.<br>

<br>

_______________________________________________<br>

gnhlug-discuss mailing list<br>

<a href="mailto:gnhlug-discuss@mail.gnhlug.org">gnhlug-discuss@mail.gnhlug.org</a><br>

<a href="http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/" rel="noreferrer" target="_blank">http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/</a><br>

</blockquote></div><br></div>