When we are talking about processing information, it all starts with getting the input data. This matter usually comes down to explore some directory (or repository) containing different sets of information, and then “do something” with it. Those repositories usually tend to be massive, so the most effective thing to do is automating the process. We don’t need superpowers to achieve this, only a few pointers are enough to make this goal reachable.
As an example for this process, I’ll explain how to recover all the content from a tree of folders containing PDF files in a very simple way. In order to sort this out, I’ll divide the problem in three main issues:
Getting the file content of a single PDF file.
Getting all the folders and sub-folders paths.
Create a loop so we retrieve all the PDF text from all the files.
Process
1. Processing the PDFs
One of the usual file formats we get is the Adobe Acrobat PDF (Portable Document Format). This format was created with the intent of being independent from application software, hardware and operating system, by storing not only the text and graphics, but the whole information about the layout and fonts. There are multiple readers, such as the Adobe Acrobat Reader, Evince or Foxit Reader.
Of course, not everything is so pretty, as the some Adobe PDF files contains XFDF (XML Forms Data Format) and therefore they are only properly rendered in proprietary Adobe programs for now. I have faith in the open source community to eventually solve this issue, as this defeats the purpose of the format.
I would also like to point out that, while PDFs may be a standard for documents which will be printed out, they are not “screen-friendly”, meaning they cannot be adapted properly to be read on e-books, tablets and smartphones in a comfortable way, as they are not able to adjust the content size to the screen. My advice is that if you are publishing a document, you may want to consider the EPUB format for the digital edition.
Every single document processing application starts with getting some source data, and in this case, we are going to use the Apache PDFBox package, an open source library which provides us several different functions such as:
Create new PDF documents.
Extract the contents from existing documents.
Manipulate a given document.
Digitally sign PDF files.
Print a PDF file using the standard Java printing API.
Save PDFs as image files, such as PNG or JPEG.
Validate PDF files against the PDF/A-1b standard.
Encrypt/decrypt a PDF document.
In this example I am only going to work with plain text, as this is an excerpt from a program where I intended to make the text indexable in order to create search relationships between different documents, so bear in mind that PDFBox can do so much more than that.
So let’s get down to business: the very first step if we are using Maven is adding the required dependencies to the pom.xml file, so we can get the library.
❕This was the stable version when the post was originally written.
Now we can work on a very short and simple snippet to read the text content from a PDF file, and store it on a String so we are able to do some the heavy processing work afterwards such as using Lucene later to index that said content and create some search functions to improve the access to information.
When we go through a folder or directory, we may find not only files, but other sub-folders, which may also contain more files or more subdirectories, and so on. The consequence of this is that we would need a way to go through all this hierarchical structure, using a recursive function. This idea would be the core of the “Spider” which will go through the “web” of files:
The “Spider” will detect all the files (File_1.pdf, File_2.pdf, File_3.pdf and File_4.pdf) thanks to a recursive structure, instead of getting stuck with only the first level of the tree (File_1.pdf and File_2.pdf).
This can be summarized in the following algorithm structure:
1 2 3 4 5
1.- Initialize the while loop for each element 2.- Is it a base case? – Yes: solve base case – No: execute recursive function 3.- End while loop
We can achieve this in Java by relying only on the java.io and java.util libraries, which are included in every Java Development Kit (JDK).
/* * Lists the files only on that level */ public List<String> listPaths(String path) { Filef=newFile(path);
List<String> l = newLinkedList<String>(); if (f.exists()) { File[] fileArray = f.listFiles(); for (inti=0; i < fileArray.length; i++) { l.add(fileArray[i].getAbsolutePath()); } } else { System.err.println("The path " + path + " is incorrect"); } return l; }
/* * Also lists the sub-directories content */ public List<String> listPathsRecursive(String path) { Filef=newFile(path); List<String> l = newLinkedList<String>(); if (f.exists()) { File[] fileArray = f.listFiles(); for (inti=0; i < fileArray.length; i++) { // check the sub-directories if (fileArray[i].isDirectory()) { List<String> l1 = listPathsRecursive( fileArray[i].getAbsolutePath()); l.addAll(l1); } else { //isValidFormat will check the file extensions //e.g. fileNameString.endsWith(".pdf") if (ClasificadorDeFicheros.isValidFormat(ficheros[i] .getAbsolutePath())) { l.add(ficheros[i].getAbsolutePath()); } } } } else { System.err.println("The path " + path + " is incorrect"); } return l; } }
3. Loop using the methods from steps 1 and 2
Finally we get to the easiest part: we just need some basic Java iterations to finish our epic SpiderPdf.java: we get all the files paths with the method from the second step, and process it by invoking the code generated on the first step.
1 2 3 4 5 6 7
// excerpt from MapFolderOfPdfs.java Iterator<String> it = spider.listPaths(mainFolderPath); Map<String, String> mapContent = newHashMap<String, String>(); while(it.asNext){ StringcurrentPath= it.next(); mapContent.put(currentPath, PdfReader.extractTextFromPdf(currentPath)); }
❗️ I would recommend working iterators if we want to work with collections, as you may consider changing the structure in the future to a new one which optimizes the access or saving time, so you do not have to rewrite that set of code. A HashMap is probably one of the bests to access information we may want to classify, but it will not get the best time to store the content. If we get to work with an increasing amount of information, you may consider a TreeMap.
Most programmers know about the concept of unit tests and have dabbled with the Junit framework while they learn the beauty of automatic tests. However, few are those who have met Mockito, which is in fact one of the best frameworks for unit testing.
We may use a layered structure on a server to split the elements according to their functionalities, and following that train of thought we can modularize the code in logical layers. That’s where Mockito comes through.
By using a “mockers system” we can substitute a whole dependant component class with a behavioural emulation, following the behaviour driven development paradigm. The best way to explain this task is using an example, so let’s suppose we have an interface called IDocumentAccessDao, which will be implemented by the class DocumentAccessDao. This data access object has some database accesses using Jdbc, and while we intend to create tests to cover all of its set of instructions, it makes no sense to actually connect to the database as it may be not available and make our tests fail (and that would also be an Integration test, not a Unit test).
Process: How do we drink this?
1. Setting up the Maven dependencies
The first step is getting the testing dependencies into our project, and that’s something we can do via Maven by adding them to the pom.xml file.
❕These were the stable versions when the post was originally written
❗️If we are using a component which may be also used in other classes (e.g. JDBC or JPA implementations to handle the connections to databases), it would be good to apply inheritance to those components, as they are highly reusable.
Let’s start by creating the test class, which we will call DocumentAccessDaoTest, but don’t forget that if you are using Spring, you may want to load the mocks from the context.xml file.
We can see that it uses calls to DocumentDAO and JdbcTemplate methods, so we would need to mock those calls to avoid running code from other classes. Therefore, we will use the following 3 attributes in our DocumentAccessDAOTest class:
documentDAO: the entity we will test.
mocker: the database connection mocker.
documentDAOMock: we intend to execute only the code in DocumentAccessDAO, so we will simulate it by getting default dummy values for every method invoked from this object.
The code on the initMock will follow this structure:
Initialize the mocks: we need to know the results expected on the different calls to mocked objects. The syntax for this methods looks like initMockForMethod (inputParameters, resultExpected), and will be detailed later.
Call the method we want to test.
Check that the result obtained is the one we expected by using assert instructions. If we expect and exception, we should use an “expected” annotation.
@TransactionAttribute(TransactionAttributeType.REQUIRED) @Interceptors(SpringBeanAutowiringInterceptor.class) @ContextConfiguration(locations = “/appDao-testConfiguration.xml”) @RunWith(SpringJUnit4ClassRunner.class) publicclassDocumentAccessDAOTest { // all the mocks will be injected into this instance @InjectMocks private DocumentAccessDAO documentDAO; // initialize the mocker via annotation @Mock private JdbcTemplate jdbcTemplate; @Mock private DocumentAccessDAO documentDAO; @Before publicvoidinitMock() { // initialize generic behaviour Mockito.when(jdbcTemplate.queryForList(Matchers.anyString()) .thenReturn(createResultList()); Mockito.when(documentDAOMock.getLastDocumentVersion(Matchers.anyString())) .thenReturn(1); Mockito.when(documentDAO.getDocumentVersion((String) Matchers.anyString(), Matchers.anyString(), Matchers.anyString()).thenReturn( creatDummyDocument()); } @Test publicvoidgetCollaborateDocumentStatusReturnsValidResultExpected() throws GenericDAOException { // method to call List<Document> listDocument = managerDAO.getCollaborateDocumentStatus(); // check result Assert.assertTrue(!listDocument.isEmpty() && listDocument.size() == 4 && listDocument.get(0).getDocId().equals(MockedDocumentValues.MY_DOCUMENT_ID) && listDocument.get(1) == null); } // the exception is managed through the annotation @Test(expected = GenericDAOException.class) publicvoidgetCollaborateDocumentStatusReturnsxceptionException() throws GenericDAOException { // initialise non-generic mocked methods for this test Mockito.when(jdbcTemplate.queryForList(Matchers.anyString()) .thenThrow( newRecoverableDataAccessException(MockedValues .GENERIC_DAO_EXCEPTION_TEXT)); // method to call managerDAO.getCollaborateDocumentStatus(); } private Document createDummyDocument(){ Documentdocument=newDocument(); document.setVersion(1); document.setDocId(MockedDocumentValues.MY_DOCUMENT_ID); return document; } }
As you can see, we use the class MockedDocumentValues.java to generate the dummy values for some parameters. This class belongs to a set of common classes named Mocked*Values on the Junit auxiliary project, to avoid duplicated values among the test cases.
Cookbook for more complex cases
I’ll do a quick syntax enumeration for all the odd situations I found:
When we want to return a value independently of the parameters content, we can use Matchers.any(Object.class) (where Object may be any custom class we prefer, or if use one of the classical Java types we may use their own methods: anyString(), anyInt(), anyList() …).
If we want to do something similar, mixing parameters whose content we don’t mind while other values may be important, we should combine Matchers.any(Object.class) and Matchers.eq(instance)
Another useful method is Matchers.isA(class). When we have a series of em.persist(object) and we have to find which one of them we actually need, we can determine it by pointing out the class of instance it belongs to.
4. Mocking a procedure with possible input/output parameters (persist method)
Sometimes, we have to check the new primary key of an Object after it has been inserted on a database via entityManager.persist(instanceObject). When this happens, we have to mock the method to simulate the answer received, as we do in this example.
/** * Mocks the update done when persisting a LegalDoc in the database */ publicvoidinitMockForPersistLegalDoc() { Mockito.doAnswer(newAssignIdToLegalDocAnswer(LEGAL_ID)).when(em).persist( Matchers.any(LegalDoc.class)); }
Another complex example using the doAnswer method would be defining via answer “on the fly” not only changes on the output or return statment, but we may be able to define input/output parameters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
publicvoidinitMockForMyProcedure(MyInputOutputObject object1){ Mockito.doAnswer(newAnswer<MyCustomOutputObject>() { @Override public MyCustomOutputObject answer(final InvocationOnMock invocation) throws Throwable { // the original argument may be changed, only for this function finalMyInputOutputObjectoriginalArgument= (invocation.getArguments())[0]; // we define the ourput parameter value here finalMyCustomOutputObjectreturnedValue=newMyCustomOutputObject(); returnedValue.setValueOutput(newMyCustomOutput()); return returnedValue; } }).when(myService).myProcedure(Matchers.any(MyInputOutputObject.class)); }
5. Mocking a JPA query-response method as a single method
This avoids problems when several pairs of “named queries”/“getResults” are used in a single method, so the results of each one of them don’t get mixed.
8. Mocking an ‘Object…’ parameter These are called vargars parameters. E.g. To mock something like JdbcTemplate.queryForList(String sql, Object… args), we need to use Matchers<CLASSNAME>.anyVararg()
In case you need to mock legacy code containing any of this issues, you should use Powermock, but taking into account that not all the releases of Mockito are totally compatible with Powermock.
❗️ When there is an evil static class in your application and you can’t get rid of it without messing up half your application, you may consider using a singleton pattern to avoid this issue.
publicfinalclassSendMailHelper{ // this way we make sure we have only one instance privatestatic SendMailHelper instance; // no one external can create new instance privateSendMailHelper(){ } //we control here the instance creation publicstatic SendMailHelper getInstance(){ if (instance == null){ instance = newSendMailHelper(); } return instance; } // just in case we need to set a mocker publicvoidsetMailHelper(SendMailHelper helper){ instance = helper; } }
Then, in the classes we used to call SendMailHelper.method() we add an attribute declaration, and when it is needed, we can set it for the tests (in the initMock() method).
1. Test only one code unit at a time When we try to test a unit, it may have multiple use cases. We should always test each use case in separate test case. For example, if we are writing test case for a function which is supposed to take two parameters and should return a value after doing some processing, the different use cases might be:
First parameter can be null. It should throw an InvalidParameterException.
Second parameter can be null. It should throw an InvalidParameterException.
Both can be null. It should throw an InvalidParameterException.
Finally, test the valid output of function. It should return valid predetermined output.
This helps when you do some code changes or some refactoring: running the test cases should be enough to check that functionality is not broken. Also, if you change any behaviour you would need to change some test cases.
2. Make each test independent to all the others Don’t create a chain of unit test cases. It will prevent you to identify the root cause of the test case failures, and you will have to spend time debugging the code. Also, it creates dependency, which means that if you have to change one test case then you need to make changes in multiple test cases unnecessarily.
3. Mock out all external services Otherwise, the behaviour of those external services overlaps multiple tests, and different unit tests can influence each other’s results.
We have to be sure each test resets the relevant statics to a known state before they run. We have to try avoiding dependences between tests and systems so running them in a different order won’t affect the outcome.
4. Name your unit tests clearly and consistently This is the most important point to keep in mind. We must name our test cases regarding what they actually do and test. A test case naming convention which uses class names and method names for test cases name is never a good idea, as every time you change the method name or class name, you will end up updating a lot of test cases as well.
But, if our test cases names are logical i.e. based on operations then you will need almost no modification, because the application logic will most possibly remain same.
E.g. Test case names should be like (supposing EmployeeTest is our Junit class):
5. Aim for each unit test method to perform exactly one assertion We should try to test only one thing per test case,so use one single assertion per test case. This way, if some test case fails, you know exactly what went wrong.
6. Create unit tests that target exceptions If some of your test cases, which expect the exceptions to be thrown from the application, use the “expected” attribute. Try avoiding catching exception in catch blocks and using fail or assert method to conclude these tests.
7. Do not print anything out in unit tests If you are correctly following all the guidelines, then you will never need to add any print statement in your test cases. If you feel like you need one, revisit your test case(s).
8. Extend from generic classes to avoid rewriting code Use generic abstract classes (e.g. JdbTemplateDAO and JpaDAO) as much as you can when you are mocking database connections
9. Check that the mocked values you are going to create don’t exist already When mocking values related to the most used entities, check that they don’t already exist on auxiliary classes.
10. Create a Junit suite when testing classes which implement more than one interface Our test design is interface oriented, so in case a class implements more than one interface, in order to see the coverage more easily we can create suites as seen in this example using the @Suite.SuiteClasses annotation.
The Open Systems Interconnection model (OSI model) is a conceptual model that characterises and standardises the communication functions of a telecommunication or computing system without regard to its underlying internal structure and technology
Number
Name
Protocol data unit (PDU
7
Application Layer
Data
6
Presentation Layer
Data
5
Session Layer
Data
4
Transport Layer
Segment, Datagram
3
Network Layer
Packet
2
Data Link Layer
Frame
1
Physical Layer
Symbol
Abstract Layers
Layer 1: Physical Layer
unstructured raw data between a device and a physical transmission medium
converts the digital bits into electrical, radio, or optical signals
Layer 2: Data Link Layer
provides node-to-node data transfer
detects and possibly corrects errors that may occur in the physical layer
defines the protocol to establish and terminate a connection between 2 physically connected devices
defines the protocol for flow control between them
Layer 3: Network Layer
provides the functional and procedural means of transferring variable length data sequences (called packets) from one node to another connected in “different networks”
Layer 4: Transport Layer
provides the functional and procedural means of transferring variable-length data sequences from a source to a destination host, while maintaining the quality of service functions.
Layer 5: Session Layer
controls the dialogues (connections) between computers
establishes, manages and terminates the connections between the local and remote application
Layer 6: Presentation Layer
establishes context between application-layer entities, in which the application-layer entities may use different syntax and semantics if the presentation service provides a mapping between them
Layer 7: Application Layer
layer closest to the end user
interacts with software applications that implement a communicating component
I have always loved coding: it is a world based on maths and physics, where logic allows you to understand causes and consequences. I already had several software geeks among my favourite characters on the comic books I read as a teenager. Of course, they were secundary or tertiary characters like X-men’s Sage or Planetary’s The Drummer. Their main skill was the ability to understand how thinks work, so they were able to see things that were hidden on plain sight for the rest of the world. What I could not guess was that in the future, by studying computer science, I would also feel as if I had that kind superpower.
The so called computer revolution started over the 90s, but nowdays computer science is tangled in our personal life with the “Smart Technology”: we have SmartPhones, SmartBands, SmartGlass, SmartWatches… The Quantified Self, or meassuring our daily information, like pulse, calories, expenses, web positioning leading to have a huge ammount of information about anyone of. The big question is: do you know how that data is handled? That is the skill you get by learning how to code: you may read the source and discover what it does and how it does it. The best thing is that you will be able to make informed and responsible decisions, and the worst is that your friends may look at you as you are some weirdo when they ask you why you do not use a certain app and you answer them: “I have seen the code… AND IT IS HORRIBLE!”.
Therefore, you may be able to edit that code and improve it. You may adapt the system to suit your taste in design, or make it more efficient and safe, like preventing it from sharing some of your data which you may not want anyone else to have. You may even repair it when you have some problems. It is great to be able to create new systems which handle a lot of information for you, so you do not have to spend a lot of time in tedious work. You may be able to spend your time on other matters or manage more information with less effort. This train of thought will lead you to love the free/libre software and open source movements, which will provide you access to the code for free. Some days you may feel like a Shadowrun character in its endless fight against megacorps, but it is worth it.
So, why is learning how to code so interesting?
Sometimes you may access to the source, audit it, or even play with it. In other cases, the software patents may prevent you from learning how it works, and that may lead you to all kinds of absurd situations. One example of that are “smart home appliances”. There are washing machines may connect to you home wifi to send a notification to your phone saying that their program is about to be completed. There are also fridges which detect that you do not have enough eggs, and may request to the nearest grocery to send a dozen to you home. The need of these systems may be arguable: my favourite ridiciulous “smart” object is a food dispenser for cats which opens the recipient when you send a tweet with a mention to its Twitter account, and yes, it that thing existed on 2011.
Now imagine that the system is poorly programmed and has “Russian roulette code” like this:
1 2 3 4 5 6 7 8
intmin=1; int max= 6; intvalue= (Math.random()*(max-min))+min; if (value == min){ die(); }else{ //your code goes here }
This is a very simple case which simulates throwing a 6-sided dice. The computer do not have tha ability to generate a random value, so they usually rely on the latest digits of the internal clock time to generate the random result. Hence, if you get a number between 2 and 6, everything will work as expected, but if you get a 1 the system will crash and die. This may lead in a silly way to planned obsolescence in its worst definition: if we went back to the smart fridge I mentioned before, it may be buying tomatos until judgement day, which may really hurt your wallet. Fixing it would be as simple as removing the whole if clause surrounding the “your code goes here” line, making it more efficient (you would be running 7 lines of code less) and security, but if this were privative code you would not be able to (or more like you should not) do it.
This kind of situation is a setback, or sometimes it could be part of a twisted business model to damage your right to repair. That is the reason why I will always prefer a system which lets me see what is happening under the hood. People can review it and improve it, making it better for everyone. This does not mean that their creators should lose all control over their work: there are different licenses, similar to creative commons, which can be properly adapted to each business model, allowing you to handle the different rights. You should always keep the authorship and moral rights, and you may (or may not) restrict economic remuneration and derivative works by choosing the model which fits you better. Making a modification does not mean doing something ill intentioned, it may solve a dangerous issue and avoid many future problems.