Extracting text from PDF with extractFileText is not working for some PDF
Show older comments
I am using the extractFileText function to extract text from PDF files but with some files the function returns an empty string.
Through the function pdfinfo, I realized that the PDF files from which extractFileText cannot extract the text have different producer tag than those for which it works. In particular, it seems that extractFileText fails to extract the text in the case where the producer tag is Producer: "iText 2.1.7 by 1T3XT".
No error message is generated; you simply get an empty string.
Can anyone help me? Thank you!
7 Comments
the cyclist
on 17 Oct 2023
Please post an example of a file or two that works, and a file or two that does not work.
iText 2.1.7 is from a product that is some 20 years past its EOL and so the pdf doc created by it is probably not compliant in some fashion with what current pdf readers presume to be so.
The <recent question about retrieving pdf bookmarks> had just sent me on a search for how to do so and came across iText during that; there appears to have been a pretty big brouhaha over revisions and using older versions and limitations about licensing and IP infringement issues back some time ago...a link to a discussion responded to by an original author that outlines some of that is <here>. So, the question might be how old are these files and from whence did they come? Whatever the answer to those, I'd suspect the chances of MATLAB/TMW fixing anything in their pdf-reading files to get around the issue is nil.
But, as @the cyclist says, attach one of them so somebody here could at least poke at it...who knows, maybe even a TMW staffer with inside knowledge just might stumble over it.
mario
on 18 Oct 2023
dpb
on 18 Oct 2023
"So I think the problem is really in iText 2.1.7."
Yes, I was virtually positive of that.
You can try sending a support/bug request to Mathworks at <support.mathworks.com> describing the issue and attaching a sample failing pdf. Do NOT expect Mathworks to go to an external site; similarly, attach the file here; very few, if any, will go to some external site. (Although I don't think it'll do much good for anybody here to try, somebody might have some other tools or a full-blown Adobe install that could experiment with.)
mario
on 18 Oct 2023
pdfinfo('cs 2023.01.03.pdf')
extractFileText('cs 2023.01.03.pdf')
probably confirms identical symptoms you get locally. I did comment to a TMW staff member who responded to another Q? on reading pdf files to make aware of this issue if comes back on the presumption might have a specific interest in MATLAB pdf file functions.
mario
on 19 Oct 2023
Answers (1)
Christopher Creutzig
on 11 Dec 2023
0 votes
This is a known issue in Text Analytics Toolbox. Please watch https://www.mathworks.com/support/bugreports/3155425 for updates.
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!