Read PDF Stamp text info c#?

Associate
Joined
27 Jan 2005
Posts
1,308
Location
S. Yorks
I have a requirement to read pdf files, certain text contained within them, I created a simple app in C# using the pdfbox library and it duly reads all of the text on the document and outputs it.

Now some of the documents contain a Stamp, I assume this is like a watermark - in my examples that I have they are a table on info about what is contained on the pdf page, so made up of text. The app I have created doesn't return the text contained within the stamp.

How can I read the stamp info, PDFbox examples I can find are showing how to create an overlay but not necessarily how to read the text contained within.

Matt
 
Associate
OP
Joined
27 Jan 2005
Posts
1,308
Location
S. Yorks
I've created a basic pdf with a rectangle on it, a stamp and a text box, ideally I need to identify the different objects on the page (can this be done?), strip the text from the pdf file then loop through the various objects and strip any text contained within them.

I'll send you a copy if you want to take a look.


Matt
 

Pho

Pho

Soldato
Joined
18 Oct 2002
Posts
9,324
Location
Derbyshire
You might find iText (formally iTextSharp) can do it. It's a fairly go-to library for both Java/C# when working with PDF files. Maybe starting with something like extracting objects from pdf may get your started.

I'm not really sure what you're asking for based on your description though.
 
Associate
OP
Joined
27 Jan 2005
Posts
1,308
Location
S. Yorks
Hi Pho,

Thanks yes, I looked at that but its a paid for licence I was looking for a free library to try a few things out.

We have PDF's and I can extract the text from these quite easily with PDFBox, however some of the pdf files have other objects on them like stamps/watermarks/overlays/textbox and I can't seem to access these and read the text from them and I don't know why not - that's the gist of the question.

Thanks for the links though.

Matt
 

Dup

Dup

Soldato
Joined
10 Mar 2006
Posts
11,225
Location
East Lancs
Hi Pho,

Thanks yes, I looked at that but its a paid for licence I was looking for a free library to try a few things out.

We have PDF's and I can extract the text from these quite easily with PDFBox, however some of the pdf files have other objects on them like stamps/watermarks/overlays/textbox and I can't seem to access these and read the text from them and I don't know why not - that's the gist of the question.

Thanks for the links though.

Matt

Are you sure it's definitely text content and not vector/raster image? PDFs don't always store text as text (can be downsampled to vector, embedded fonts removed etc) and text content is not always logically stored, as in they don't always know what is a sentence, paragraph etc. I've seen characters stored individually to each other and other such weirdness in the past so it's not always reliable reading text from them. If you can open the PDF normally and select it to copy/paste then hopefully you should see it in a PDF to text tool, but if not it's likely vector. Opening in a free PDF editor or Illustrator if you have it will give you an idea of how the stamp is made up.
 
Associate
OP
Joined
27 Jan 2005
Posts
1,308
Location
S. Yorks
Pho,

Thanks for that, I'll take another look.

Dup,

Opening the PDF I can rightclick the textbox and it says it is a textbox, but I cannot read the text from it with the reader. Its not a acroform either as that shows as null.


Matt
 
Back
Top Bottom