Spider files

Introduction

When we are talking about processing information, it all starts with getting the input data. This matter usually comes down to explore some directory (or repository) containing different sets of information, and then “do something” with it. Those repositories usually tend to be massive, so the most effective thing to do is automating the process. We don’t need superpowers to achieve this, only a few pointers are enough to make this goal reachable.

As an example for this process, I’ll explain how to recover all the content from a tree of folders containing PDF files in a very simple way. In order to sort this out, I’ll divide the problem in three main issues:

  • Getting the file content of a single PDF file.
  • Getting all the folders and sub-folders paths.
  • Create a loop so we retrieve all the PDF text from all the files.

Process

1. Processing the PDFs

One of the usual file formats we get is the Adobe Acrobat PDF (Portable Document Format). This format was created with the intent of being independent from application software, hardware and operating system, by storing not only the text and graphics, but the whole information about the layout and fonts. There are multiple readers, such as the Adobe Acrobat Reader, Evince or Foxit Reader.

Of course, not everything is so pretty, as the some Adobe PDF files contains XFDF (XML Forms Data Format) and therefore they are only properly rendered in proprietary Adobe programs for now. I have faith in the open source community to eventually solve this issue, as this defeats the purpose of the format.

I would also like to point out that, while PDFs may be a standard for documents which will be printed out, they are not “screen-friendly”, meaning they cannot be adapted properly to be read on e-books, tablets and smartphones in a comfortable way, as they are not able to adjust the content size to the screen. My advice is that if you are publishing a document, you may want to consider the EPUB format for the digital edition.

Every single document processing application starts with getting some source data, and in this case, we are going to use the Apache PDFBox package, an open source library which provides us several different functions such as:

  • Create new PDF documents.
  • Extract the contents from existing documents.
  • Manipulate a given document.
  • Digitally sign PDF files.
  • Print a PDF file using the standard Java printing API.
  • Save PDFs as image files, such as PNG or JPEG.
  • Validate PDF files against the PDF/A-1b standard.
  • Encrypt/decrypt a PDF document.

In this example I am only going to work with plain text, as this is an excerpt from a program where I intended to make the text indexable in order to create search relationships between different documents, so bear in mind that PDFBox can do so much more than that.

So let’s get down to business: the very first step if we are using Maven is adding the required dependencies to the pom.xml file, so we can get the library.

❕This was the stable version when the post was originally written.

1
2
3
4
5
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.6</version>
</dependency>

Now we can work on a very short and simple snippet to read the text content from a PDF file, and store it on a String so we are able to do some the heavy processing work afterwards such as using Lucene later to index that said content and create some search functions to improve the access to information.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PdfReader {

public static String extractTextFromPdf(String path) throws IOException {
System.out.println("Parsing a PDF");
String parsedText = "";

File f = new File(path);
if (!f.isFile()) {
System.err.println("The file " + path + " doesn't exist.");
return null;
}

PDFParser parser = new PDFParser(new FileInputStream(f));
parser.parse();
COSDocument cosDoc = parser.getDocument();
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);

return parsedText;
}

}

2. The Spider and its web

When we go through a folder or directory, we may find not only files, but other sub-folders, which may also contain more files or more subdirectories, and so on. The consequence of this is that we would need a way to go through all this hierarchical structure, using a recursive function. This idea would be the core of the “Spider” which will go through the “web” of files:

1
2
3
4
5
6
+ Directory_1
|-- File_1.pdf
|-- File_2.pdf
|-+ Directory_2
|-- File_3.pdf
|-- File_4.pdf

The “Spider” will detect all the files (File_1.pdf, File_2.pdf, File_3.pdf and File_4.pdf) thanks to a recursive structure, instead of getting stuck with only the first level of the tree (File_1.pdf and File_2.pdf).

This can be summarized in the following algorithm structure:

1
2
3
4
5
1.- Initialize the while loop for each element
2.- Is it a base case?
– Yes: solve base case
– No: execute recursive function
3.- End while loop

We can achieve this in Java by relying only on the java.io and java.util libraries, which are included in every Java Development Kit (JDK).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import java.io.File;
import java.util.LinkedList;
import java.util.List;

public class Spider {

/*
* Lists the files only on that level
*/
public List<String> listPaths(String path) {
File f = new File(path);

List<String> l = new LinkedList<String>();
if (f.exists()) {
File[] fileArray = f.listFiles();
for (int i = 0; i < fileArray.length; i++) {
l.add(fileArray[i].getAbsolutePath());
}
} else {
System.err.println("The path " + path + " is incorrect");
}
return l;
}

/*
* Also lists the sub-directories content
*/
public List<String> listPathsRecursive(String path) {
File f = new File(path);
List<String> l = new LinkedList<String>();
if (f.exists()) {
File[] fileArray = f.listFiles();
for (int i = 0; i < fileArray.length; i++) {
// check the sub-directories
if (fileArray[i].isDirectory()) {
List<String> l1 = listPathsRecursive(
fileArray[i].getAbsolutePath());
l.addAll(l1);
} else {
//isValidFormat will check the file extensions
//e.g. fileNameString.endsWith(".pdf")
if (ClasificadorDeFicheros.isValidFormat(ficheros[i]
.getAbsolutePath())) {
l.add(ficheros[i].getAbsolutePath());
}
}
}
} else {
System.err.println("The path " + path + " is incorrect");
}
return l;
}
}

3. Loop using the methods from steps 1 and 2

Finally we get to the easiest part: we just need some basic Java iterations to finish our epic SpiderPdf.java: we get all the files paths with the method from the second step, and process it by invoking the code generated on the first step.

1
2
3
4
5
6
7
// excerpt from MapFolderOfPdfs.java
Iterator<String> it = spider.listPaths(mainFolderPath);
Map<String, String> mapContent = new HashMap<String, String>();
while(it.asNext){
String currentPath = it.next();
mapContent.put(currentPath, PdfReader.extractTextFromPdf(currentPath));
}

❗️ I would recommend working iterators if we want to work with collections, as you may consider changing the structure in the future to a new one which optimizes the access or saving time, so you do not have to rewrite that set of code. A HashMap is probably one of the bests to access information we may want to classify, but it will not get the best time to store the content. If we get to work with an increasing amount of information, you may consider a TreeMap.