Read PDF Stamp text info c#?

Matt_Hirst · 15 Jan 2020 at 14:15

I have a requirement to read pdf files, certain text contained within them, I created a simple app in C# using the pdfbox library and it duly reads all of the text on the document and outputs it.

Now some of the documents contain a Stamp, I assume this is like a watermark - in my examples that I have they are a table on info about what is contained on the pdf page, so made up of text. The app I have created doesn't return the text contained within the stamp.

How can I read the stamp info, PDFbox examples I can find are showing how to create an overlay but not necessarily how to read the text contained within.

Matt

jsmoke · 15 Jan 2020 at 14:17

Got an example pdf?

Matt_Hirst · 15 Jan 2020 at 14:19

I'll try to find an example that I can share.

Matt

Matt_Hirst · 16 Jan 2020 at 10:09

I've created a basic pdf with a rectangle on it, a stamp and a text box, ideally I need to identify the different objects on the page (can this be done?), strip the text from the pdf file then loop through the various objects and strip any text contained within them.

I'll send you a copy if you want to take a look.

Matt

Pho · 16 Jan 2020 at 10:20

You might find iText (formally iTextSharp) can do it. It's a fairly go-to library for both Java/C# when working with PDF files. Maybe starting with something like extracting objects from pdf may get your started.

I'm not really sure what you're asking for based on your description though.

Matt_Hirst · 16 Jan 2020 at 11:00

Hi Pho,

Thanks yes, I looked at that but its a paid for licence I was looking for a free library to try a few things out.

We have PDF's and I can extract the text from these quite easily with PDFBox, however some of the pdf files have other objects on them like stamps/watermarks/overlays/textbox and I can't seem to access these and read the text from them and I don't know why not - that's the gist of the question.

Thanks for the links though.

Matt

Dup · 16 Jan 2020 at 14:36

Matt_Hirst said:
Hi Pho,

Thanks yes, I looked at that but its a paid for licence I was looking for a free library to try a few things out.

We have PDF's and I can extract the text from these quite easily with PDFBox, however some of the pdf files have other objects on them like stamps/watermarks/overlays/textbox and I can't seem to access these and read the text from them and I don't know why not - that's the gist of the question.

Thanks for the links though.

Matt

Are you sure it's definitely text content and not vector/raster image? PDFs don't always store text as text (can be downsampled to vector, embedded fonts removed etc) and text content is not always logically stored, as in they don't always know what is a sentence, paragraph etc. I've seen characters stored individually to each other and other such weirdness in the past so it's not always reliable reading text from them. If you can open the PDF normally and select it to copy/paste then hopefully you should see it in a PDF to text tool, but if not it's likely vector. Opening in a free PDF editor or Illustrator if you have it will give you an idea of how the stamp is made up.

Pho · 16 Jan 2020 at 16:04

Matt_Hirst said:
Thanks yes, I looked at that but its a paid for licence I was looking for a free library to try a few things out.

According to Github it's also under an AGPL license which is free to use commercially:

https://tldrlegal.com/license/gnu-affero-general-public-license-v3-(agpl-3.0)

Matt_Hirst · 17 Jan 2020 at 12:53

Pho,

Thanks for that, I'll take another look.

Dup,

Opening the PDF I can rightclick the textbox and it says it is a textbox, but I cannot read the text from it with the reader. Its not a acroform either as that shows as null.

Matt