Extracting text from PDF with extractFileText is not working for some PDF

I am using the extractFileText function to extract text from PDF files but with some files the function returns an empty string.
Through the function pdfinfo, I realized that the PDF files from which extractFileText cannot extract the text have different producer tag than those for which it works. In particular, it seems that extractFileText fails to extract the text in the case where the producer tag is Producer: "iText 2.1.7 by 1T3XT".
No error message is generated; you simply get an empty string.
Can anyone help me? Thank you!

7 Comments

Please post an example of a file or two that works, and a file or two that does not work.
iText 2.1.7 is from a product that is some 20 years past its EOL and so the pdf doc created by it is probably not compliant in some fashion with what current pdf readers presume to be so.
The <recent question about retrieving pdf bookmarks> had just sent me on a search for how to do so and came across iText during that; there appears to have been a pretty big brouhaha over revisions and using older versions and limitations about licensing and IP infringement issues back some time ago...a link to a discussion responded to by an original author that outlines some of that is <here>. So, the question might be how old are these files and from whence did they come? Whatever the answer to those, I'd suspect the chances of MATLAB/TMW fixing anything in their pdf-reading files to get around the issue is nil.
But, as @the cyclist says, attach one of them so somebody here could at least poke at it...who knows, maybe even a TMW staffer with inside knowledge just might stumble over it.
Thanks for the responses and suggestions.
I noticed that if I open the file with a pdf viewer and then print the file by selecting the "Microsoft Print to PDF" option, it then becomes possible to extract the text via the extractFileText function. The producer tag of the new pdf file is: "Microsoft: Print To PDF."
So I think the problem is really in iText 2.1.7.
The pdf files in question are recent local Italian newspapers; you can download an example here:
https://www.dropbox.com/scl/fi/zjhekigzoi6h0fsmje4gr/cs-2023.01.21.pdf?rlkey=4g19mrhbfhhh50nffzrshc6tn&dl=0
"So I think the problem is really in iText 2.1.7."
Yes, I was virtually positive of that.
You can try sending a support/bug request to Mathworks at <support.mathworks.com> describing the issue and attaching a sample failing pdf. Do NOT expect Mathworks to go to an external site; similarly, attach the file here; very few, if any, will go to some external site. (Although I don't think it'll do much good for anybody here to try, somebody might have some other tools or a full-blown Adobe install that could experiment with.)
Thank you for the response.
I apologize for including an external link.
I am attaching here a PDF file that presents the problem described, in case anyone would like to test it.
Thank you.
pdfinfo('cs 2023.01.03.pdf')
ans = struct with fields:
NumPages: 40 PageSize: [40×4 double] PDFVersion: "1.4" Title: "" Subject: "" Language: "" Keywords: "" Author: "" Creator: "" Producer: "iText 2.1.7 by 1T3XT" CreationDate: 03-Jan-2023 03:17:20 ModificationDate: 03-Jan-2023 03:17:20 Encrypted: 0 AllowsTextExtraction: 1 Filename: "/users/mss.system.asxxnt/cs 2023.01.03.pdf"
extractFileText('cs 2023.01.03.pdf')
ans = ""
probably confirms identical symptoms you get locally. I did comment to a TMW staff member who responded to another Q? on reading pdf files to make aware of this issue if comes back on the presumption might have a specific interest in MATLAB pdf file functions.

Sign in to comment.

Categories

Products

Release

R2023a

Asked:

on 17 Oct 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!