Extract text from PDF document based on position c++ -

i trying extract text pdf document based on it's coordinates, have came across 2 notions in adobe pdf reference (chap. 5.3):

text positioning operators
text showing operators

for interested in td & tm positioning operators, while using td have tx , ty, relative start of current line specified in pdf document: tx ty td, have used method extract text tx , ty coordinates. problem don't know how extract text pdf based on position, while supplying tx , ty.

a b c d e f tm

this 'formula for' tm usage. a-f values represent ? input tm:

bt /f1 8.88 tf 0 0 0 rg 0.9998 0 0 1 401.52 448.08 tm [<0014>-11<0015>-11<0013>-11<000f>-19<0014>-11<0019>] tj et

why each group of 4 have leading 00 ? in hex? should convert hex int , corresponding character?

this input td:

bt 43.20 421.90 td 0 tw /c001 10.00 tf 0.00 tw <blablablatextinhexthaticanprocess>tj et

this clearer, coordinates clearer. how extract text tm positioned pdf text object based on simple x , y coordinates? using c++ , podofo library

first of all, when trying to extract text pdf based on position, while supplying tx , ty, not suffice consider text matrix (which set using tm operator found). have consider current transformation matrix!

i assume when refer position given in default user space coordinates.

to avoid device-dependent effects of specifying objects in device space, pdf defines device-independent coordinate system bears same relationship current page, regardless of output device on printing or displaying occurs. device-independent coordinate system called user space.

the user space coordinate system shall initialized default state each page of document. cropbox entry in page dictionary shall specify rectangle of user space corresponding visible area of intended output medium (display window or printed page). positive x axis extends horizontally right , positive y axis vertically upward

(section 8.3.2.3, iso 32000-1:2008)

as see x , y coordinates, see position vector (x, y) in r². internally, though, pdfs consider plane embedded in r³ constant z coordinate value 1, i.e. [x, y, 1]. because pdf wants allow numerous kinds of transformations (translations, rotations, scaling, skewing, ...) on other hand wants limit required mathematical operations far possible. incidentally after embedding our plane [x, y, 1] r³ these transformations possible means of matrix multiplications:

single transformation

here see numbers a, b, c, d, e, , f asked about.

now, before taking text specific transformations account, have take account manipulations of current (text independent) transformation matrix. matrix manipulated cm operators:

a b c d e f cm modify current transformation matrix (ctm) concatenating specified matrix (see 8.3.2, "coordinate spaces"). although operands specify matrix, shall written 6 separate numbers, not array.

(section 8.4.4, iso 32000-1:2008)

this implies, btw, have consider cm operators in action, i.e. presented since start of page content, exception of revoked restoring former graphics state (cf. operators q , q pushing , restoring graphic states, section 8.4.2, iso 32000-1:2008).

only can consider text specific transformation matrices:

at beginning of text object, tm shall identity matrix; therefore, origin of text space shall same of user space. text-positioning operators, described in table 108, alter tm , thereby control placement of glyphs subsequently painted. also, text-showing operators, described in table 109, update tm (by altering e , f translation components) take account horizontal or vertical displacement of each glyph painted character or word-spacing parameters in text state.

additionally, within text object, conforming reader shall keep track of text line matrix, tlm, captures value of tm @ beginning of line of text. text-positioning , text-showing operators shall read , set tlm on specific occasions mentioned in tables 108 , 109

(section 9.4.2, iso 32000-1:2008)

thus, inside of text object have keep track of text matrix set using tm operator found operands arranged in matrix shown above manipulated effect of other text positioning , text showing operators.

and there still additional parameters determining final position of text, text state parameters tfs (the text font size), th (the horizontal scaling), , trise (the text rise), cf. section 9.3.1, iso 32000-1:2008.

conceptually, entire transformation text space device space [or in case default user space] may represented text rendering matrix, trm:

text rendering matrix

trm temporary matrix; conceptually, recomputed before each glyph painted during text-showing operation.

(section 9.4.2, iso 32000-1:2008)

thus, coordinates (x, y) conceptually result text space coordinates multiplication trm:

[x, y, 1] = [xts, yts, 1] x trm

where (xts, yts) (0, 0) @ glyphs origin. every glyph printed have glyph displacement point next glyph origin positioned:

glyph displacement

the text matrix shall updated these glyph displacement values follows:

text matrix update glyph displacement

(section 9.4.4, iso 32000-1:2008)

i quoted number of paragraphs current pdf specification iso 32000-1:2008. gather preferable using pdf reference 1.4 es quite ancient; furthermore has been called "not normative in nature" adobe personal.

edit some clarifications in answer comments

device space , user space, distinction between them, isn't device space reffering printer/ video display? , user space way of overcoming every device's particularities? user page being document page see?

yes, device space fixed coordinate system determined properties of device @ hand. , yes, user space coordinate system independant target device. no, not "the document page see" because see on device (or after being processed device).

the user space coordinate system independent coordinate system coordinates of of point of can translated device coordinates means of matrix multiplication current transformation matrix (ctm).

usercoords x ctm = devicecoords

the user space coordinate system initialized state the cropbox entry in page dictionary specifies rectangle of user space corresponding visible area (see above) initializing ctm accordingly.

but choice of words indicates ("current transformation matrix", "the coordinate system initialized"), user space coordinate system dynamic, everchanging coordinate system.

the default user space provides consistent, dependable starting place pdf page descriptions regardless of output device used. if necessary, pdf content stream may modify user space more suitable needs applying coordinate transformation operator, cm (see 8.4.4, "graphics state operators"). thus, may appear absolute coordinates in content stream not absolute respect current page because expressed in coordinate system may slide around , shrink or expand. coordinate system transformation not enhances device-independence useful tool in own right.

(section 8.3.2.3, iso 32000-1:2008)

thus, when pdfreader stumbles upon cm operator parameters representing matrix m, ctm changes:

ctmnew = m x ctmold

and coordinates present in following operators interpreted according new matrix ctmnew:

usercoords x ctmnew = devicecoords

so user space coordinate system might different former state, scaled, rotated, skewed, whatever.

the coordinates interested in in coordinate system user space initialized as, i.e. device coordinate system virtual device ctm initialized identity matrix.

where text space , glyph space start , end.

the coordinates of text specified in text space. transformation text space user space defined text matrix in combination several text-related parameters in graphics state (see 9.4.2, "text-positioning operators").

the text matrix tm initialized identity matrix @ start of text object changes during execution of text operations, visibly when use tm operator, implicitly when use others. matrix manipulated matrix tr containing text-related parameters font size, horizontal scaling, , text rise. details see text rendering matrix trm above. thus,

devicecoords = usercoords x ctm = textcoords x tr x tm x ctm

the transformation glyph space text space shall defined font matrix. types of fonts, matrix shall predefined map 1000 units of glyph space 1 unit of text space; type 3 fonts, font matrix shall given explicitly in font dictionary (see 9.6.5, "type 3 fonts").

thus, transformation depends on current font. font matrix fm font dictionary act this:

devicecoords = glyphcoords x fm x tr x tm x ctm

you not want locate device coordinates of single segment of glyph, these coordinates not seem interest. glyph widths, though, interpreted in glyph space. unless dealing type 3 fonts, though, merely means have divide them 1000...

and how parameters w0 , w1 evolve during glyph painting? (0,0)

w0 , w1 denote glyph's horizontal , vertical displacements. in horizontal writing mode, w0 glyph widths transformed text mode (i.e. merely divided 1000) , w1 0. vertical writing mode text inspect sections 9.2.4 , 9.7.4.3 in iso 32000-1:2008.

does text space have same origin first glyph space? , updated calculated (tx,ty)?

as glyph space coordinates merely multiplied font matrix result in text space coordinates , font matrix in cases type 3 fonts merely compresses factor of 1000, see above, glyph origin mapped text space origin.

but tx , ty used update text matrix itself. thus, text spece coordinate system moves each glyph , each (non-type 3) glyph origin maps origin... of changed text space coordinate system.

Search This Blog

Parth Code

Extract text from PDF document based on position c++ -

Comments

Post a Comment

Popular posts from this blog

c# - WPF Converters DLL - Failed to Add Reference -

linux - xterm copying to CLIPBOARD using copy-selection causes automatic updating of CLIPBOARD upon mouse selection -

qt - Errors in generated MOC files for QT5 from cmake -