Quickly Search Strings inside PDF files

I have ~25,000 PDF files that I want to classify based on the presence of keywords in their text. I know there's a PDF Toolbox that provides MATLAB with an interface for reading PDF text, but the fact that it comes from Sourceforge makes it difficult to obtain (this is for work) and the reliance on java seems to me like it would make the process very slow -especially for searching so many files. Is there a simpler, faster way to parse these documents if all I want to do is basically strfind on the text to check for keywords?

7 Comments

Parsing PDFs is not at all trivial. Search this forum for the discussions on parsing PDFs to get an idea of why. In a nutshell PDFs are really a layout standard rather than a text document standard, which is why it is possible to do things like this:
that LaTeX package displays text (e.g. email address) in the correct order, but the characters are stored in the PDF in a random order. Think about that for a minute. And now try think how a PDF parser should be able to "know" the correct character order that you see on the page: any naive file parser will only read random text.
Summary: parsing PDFs is not trivial, and not fast.
The best solution is to avoid using PDFs to store text data, and use a more suitable format (TXT, XML, ODF, etc).
With respect sir, if parsing PDFs was trivial I wouldn't be asking the question. The text may not be stored in an intuitive sequential format but it must have some kind of position identifiers attached to the characters, otherwise it couldn't construct any coherent documents. I have searched this forum extensively for an alternative solution to the PDF Toolbox mentioned above, but haven't found one that addresses working with documents in bulk. That is the aim of this question. We would always like to have the data in a format other than what we're given. That isn't always an option.
I haven't used this open source PDF Toolbox, but there's nothing wrong with using java libraries in Matlab. It's not generally slower than other approaches unless you find a .mex library for parsing PDFs.
jgg
jgg on 11 Apr 2016
Edited: jgg on 11 Apr 2016
What Stephen is saying is that without a great deal of detailed information about the underlying structure of the PDF files you're not likely to have any success. Ben Litchfield's PDFBox Java library is the only really robust way to do this at present, and even that might not work. I don't think anyone else has a library that can even attempt the task you're looking at doing, let alone doing it quickly.
Stephen's suggestion is that you would probably have better success trying to get your data in a more amenable format than trying to apply another method to the PDF document. Apparently, some people have had success converting to Excel then reading in those values.
"Is there a simpler, faster way"
"if parsing PDFs was trivial I wouldn't be asking the question"
Dear Acme: There are a lot of ways in your catalog to avoid hurting myself when I walk off of cliffs, but they are cumbersome and take a lot of preparation. Is there a simpler faster way I can avoid hurting myself when I do that?
Dear Coyote: Don't look down.
Michael B
Michael B on 18 Apr 2016
Edited: Michael B on 18 Apr 2016
@jgg: That's a good explanation. If you'd made this an official answer I'd accept it. Sounds like I'll have to figure out a clever way to get PDFBox on my work machine without breaking the rules...
@Walter Roberson: Thank you sir, that is tremendously helpful. You are truly a font of knowledge and support. I can only hope that one day I might possess a fraction of your wit.
For batch extracting I see the commercial product https://www.qoppa.com/files/pdfstudio/guide/batch-extract-text-from-pdf.htm (which I have never used.)
I also see instructions at https://kenbenoit.net/how-to-batch-convert-pdf-files-to-text/ for a free convertor. As those instructions basically involve preparing a file of names and then running a shell script, then building the file name list inside MATLAB would not be difficult. Running the converter would be simple in MacOS or Linux; in Windows it would take more work.

Sign in to comment.

 Accepted Answer

Jan
Jan on 11 Apr 2016
PDFs are designed to guarantee an equal output on different machines. You want to create a catalogue of the contained strings. These two jobs do not match.
What about converting the PDFs by one of the many pdf2text tools and work on the text files? E.g. http://www.foolabs.com/xpdf, http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-NET

1 Comment

Since I have to parse 25,000 files, using an external converter really isn't a viable option unless it has batch capability. Alright, I guess there really isn't anything simpler than the PDFBox tool for a MATLAB interface. Thanks.

Sign in to comment.

More Answers (1)

Have you tried Text Analytics Toolbox ?

1 Comment

Is there ANY way to effectively speed up textanalytics.internal.pdfparser.extractText?
A single page can take up to 20 seconds... I just want to extract a small section of text.
-Ben

Sign in to comment.

Asked:

on 11 Apr 2016

Commented:

on 4 Jan 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!