ocr - Training tesseract - shapeclustering issue -

June 15, 2011

i'm trying train tesseract (adding new, digit font) per instructions found here: http://code.google.com/p/tesseract-ocr/wiki/trainingtesseract3

what i've done:

created pdf sample text, converted tif, ran tesseract num.dot.exp0.tif num.dot.exp0 batch.nochop makebox digits. edited generated box file, correcting wrong detections
ran tesseract on training mode: tesseract num.dot.exp0.tif num.dot.exp0 nobatch box.train , extracted unicharset unicharset_extractor num.dot.exp0.box
created font_properties file: echo "num.dot.exp0 0 0 0 0 0" > font_properties

everything ok far, .box , unicharset files correct, num.dot.exp0.tr generated.

then ran shapeclustering -f font_properties -u unicharset num.dot.exp0.tr , got following error:

      reading num.dot.exp0.tr ...      *** glibc detected *** shapeclustering: double free or corruption (!prev): 0x098c52e0 ***     ======= backtrace: =========     /lib/i386-linux-gnu/libc.so.6(+0x75ee2)[0x82eee2]     /usr/lib/i386-linux-gnu/libstdc++.so.6(_zdlpv+0x1f)[0x77d51f]     /usr/lib/i386-linux-gnu/libstdc++.so.6(_zdapv+0x1b)[0x77d57b]     shapeclustering(_zn13genericvectoriie5clearev+0x8b)[0x8050949]     shapeclustering(_zn13genericvectoriied1ev+0x2b)[0x805056b]     /usr/lib/libtesseract.so.3(_zn9tesseract17trainingsampleset14setupfontidmapev+0x137)[0x488699]     /usr/lib/libtesseract.so.3(_zn9tesseract17trainingsampleset22organizebyfontandclassev+0x22)[0x48823c]     /usr/lib/libtesseract.so.3(_zn9tesseract13mastertrainer24replacefragmentedsamplesev+0x1d7)[0x477ebd]     /usr/lib/libtesseract.so.3(_zn9tesseract13mastertrainer15postloadcleanupev+0x47)[0x47587b]     shapeclustering[0x804e2b9]     shapeclustering(main+0x5f)[0x804cb13]     /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7d24d3]     shapeclustering[0x804ca21]     (...)     00cba000-00cc1000 rw-p 0039c000 08:01 4465015    /usr/lib/libtesseract.so.3.0.2     00cc1000-00d5c000 rw-p 00000000 00:00 0      00ef8000-00f22000 r-xp 00000000 08:01 4211867    /lib/i386-linux-gnu/libm-2.15.so     00f22000-00f23000 r--p 00029000 08:01 4211867    /lib/i386-linux-gnu/libm-2.15.so     00f23000-00f24000 rw-p 0002a000 08:01 4211867    /lib/i386-linux-gnu/libm-2.15.so     08048000-08056000 r-xp 00000000 08:01 4464615    /usr/bin/shapeclustering     08056000-08057000 r--p 0000d000 08:01 4464615    /usr/bin/shapeclustering     08057000-08058000 rw-p 0000e000 08:01 4464615    /usr/bin/shapeclustering     093c5000-094cf000 rw-p 00000000 00:00 0          [heap]     b779a000-b77a0000 rw-p 00000000 00:00 0      b77b6000-b77ba000 rw-p 00000000 00:00 0      bfb6c000-bfb8d000 rw-p 00000000 00:00 0          [stack]     aborted (core dumped)

then empty shapetable created.

have done wrong? clues why happening?

i'm using tesseract 3.02

i managed find out problem. should have used echo "dot 0 0 0 0 0" > font_properties instead of echo "num.dot.exp0 0 0 0 0 0" > font_properties

shapeclustering worked after that. needs real font name on font_properties, not complete name ("dot", in case).

Search This Blog

Parth Code

ocr - Training tesseract - shapeclustering issue -

Comments

Post a Comment

Popular posts from this blog

c# - WPF Converters DLL - Failed to Add Reference -

linux - xterm copying to CLIPBOARD using copy-selection causes automatic updating of CLIPBOARD upon mouse selection -

qt - Errors in generated MOC files for QT5 from cmake -