Spider files

Introduction
When we are talking about processing information, it all starts with getting the input data. This matter usually comes down to explore some directory (or repository) containing different sets of information, and then “do something” with it. Those repositories usually tend to be massive, so the most effective thing to do is automating the process. We don’t need superpowers to achieve this, only a few pointers are enough to make this goal reachable.
As an example for this process, I’ll explain how to recover all the content from a tree of folders containing PDF files in a very simple way. In order to sort this out, I’ll divide the problem in three main issues:
- Getting the file content of a single PDF file.
- Getting all the folders and sub-folders paths.
- Create a loop so we retrieve all the PDF text from all the files.
Process
1. Processing the PDFs
One of the usual file formats we get is the Adobe Acrobat PDF (Portable Document Format). This format was created with the intent of being independent from application software, hardware and operating system, by storing not only the text and graphics, but the whole information about the layout and fonts. There are multiple readers, such as the Adobe Acrobat Reader, Evince or Foxit Reader.
Of course, not everything is so pretty, as the some Adobe PDF files contains XFDF (XML Forms Data Format) and therefore they are only properly rendered in proprietary Adobe programs for now. I have faith in the open source community to eventually solve this issue, as this defeats the purpose of the format.
I would also like to point out that, while PDFs may be a standard for documents which will be printed out, they are not “screen-friendly”, meaning they cannot be adapted properly to be read on e-books, tablets and smartphones in a comfortable way, as they are not able to adjust the content size to the screen. My advice is that if you are publishing a document, you may want to consider the EPUB format for the digital edition.
Every single document processing application starts with getting some source data, and in this case, we are going to use the Apache PDFBox package, an open source library which provides us several different functions such as:
- Create new PDF documents.
- Extract the contents from existing documents.
- Manipulate a given document.
- Digitally sign PDF files.
- Print a PDF file using the standard Java printing API.
- Save PDFs as image files, such as PNG or JPEG.
- Validate PDF files against the PDF/A-1b standard.
- Encrypt/decrypt a PDF document.
In this example I am only going to work with plain text, as this is an excerpt from a program where I intended to make the text indexable in order to create search relationships between different documents, so bear in mind that PDFBox can do so much more than that.
So let’s get down to business: the very first step if we are using Maven is adding the required dependencies to the pom.xml
file, so we can get the library.
❕This was the stable version when the post was originally written.
1 | <dependency> |
Now we can work on a very short and simple snippet to read the text content from a PDF file, and store it on a String so we are able to do some the heavy processing work afterwards such as using Lucene later to index that said content and create some search functions to improve the access to information.
1 | import java.io.File; |
2. The Spider and its web
When we go through a folder or directory, we may find not only files, but other sub-folders, which may also contain more files or more subdirectories, and so on. The consequence of this is that we would need a way to go through all this hierarchical structure, using a recursive function. This idea would be the core of the “Spider” which will go through the “web” of files:
1 | + Directory_1 |
The “Spider” will detect all the files (File_1.pdf, File_2.pdf, File_3.pdf and File_4.pdf) thanks to a recursive structure, instead of getting stuck with only the first level of the tree (File_1.pdf and File_2.pdf).
This can be summarized in the following algorithm structure:
1 | 1.- Initialize the while loop for each element |
We can achieve this in Java by relying only on the java.io
and java.util
libraries, which are included in every Java Development Kit (JDK).
1 | import java.io.File; |
3. Loop using the methods from steps 1 and 2
Finally we get to the easiest part: we just need some basic Java iterations to finish our epic SpiderPdf.java
: we get all the files paths with the method from the second step, and process it by invoking the code generated on the first step.
1 | // excerpt from MapFolderOfPdfs.java |
❗️ I would recommend working iterators if we want to work with collections, as you may consider changing the structure in the future to a new one which optimizes the access or saving time, so you do not have to rewrite that set of code. A HashMap
is probably one of the bests to access information we may want to classify, but it will not get the best time to store the content. If we get to work with an increasing amount of information, you may consider a TreeMap
.