Key Considerations in Big Data Application Testing
2016 is emerging as the year of Big Data. Those leveraging big data are sure to surge ahead while those who do not will fall behind. According to the Viewpoint Report, “76% (of organizations) are planning to increase or maintain their investment in Big Data over 2 – 3 years”. Data emerging from social networks, mobile, CRM records, purchase histories etc. provide companies with valuable insights to uncover hidden patterns that can help enterprises chart their growth story. Clearly, when we are talking about data, we are talking about huge volumes that amount to almost petabytes, exabytes and sometimes even zettabytes. Along with this huge volume, this data which originates from different sources also needs to be processed at a speed that will make it relevant to the organizations. To make this enterprise data useful, it has to be projected through the users via applications.
As with all other applications, testing forms an important part of Big Data applications as well. However, testing Big Data applications has more to do with verification of the data rather than testing of the individual features. When it comes to testing a Big Data application, there are a few hurdles that we need to cross.
Since data information is fetched from different sources, for it to be useful, it needs live integration. This can be achieved by end-to-end testing of the data sources to ensure that the data used is clean, data sampling and data cataloging techniques are correct and that the application does not have a scalability problem. Along with this, the application has to be tested thoroughly to facilitate live deployment.
The most important thing for a tester, testing a big data application thus becomes the data itself. When testing Big Data applications, the tester needs to dig into unstructured or semi-structured data with changing schema. These applications can also not be tested via ‘Sampling’ as in data warehouse applications. Since Big Data applications contain very large data sets, testing has to be done with the help of research and development. So how does a tester go about testing Big Data applications?
To begin with, testing of Big Data applications demand the testers to verify the large volumes of data by employing the clustering method. The data can be processed interactively, real-time or in batches. Checking the quality of data also becomes of critical importance to check for accuracy, duplication, validity, consistency, completeness etc. We can broadly divide Big Data application testing into three basic categories:
- Data Validation:
Data Validation, also known as the pre-Hadoop testing, ensures that the right data is collected from the right sources. Once this is done, the data is then pushed into the Hadoop system and tallied with the source data to ensure that they match in this system and are pushed into the right location.
- Business Logic validation:
Business logic validation is the validation of “Map Reduce” which is the heart of Hadoop. During this validation, the tester has to verify the business logic on every node and then verify it against multiple nodes. This is done to ensure that the Map reduce process works correctly, data segregation and aggregation rules are correctly implemented and key value pairs are generated correctly.
- Output validation:
This is the final stage of Big Data testing where the output data files are generated and then moved to the required system or the data warehouse. Here the tester checks the data integrity, ensures that data is loaded successfully into the target system, and warrants that there is no data corruption by comparing HDFS file system data with target data.
Architecture Testing forms a crucial part of Big Data Testing as a poor architecture will lead to poor performance. Also, since Hadoop is extremely resource intensive and processes large volumes of data, architectural testing becomes essential. Along with this, since Big Data applications involve a lot of shifting of data, Performance Testing assumes an even more important role in identifying:
- Memory utilization
- Job completion time
- Data throughput
When it comes to Performance Testing, the tester has to take a very structured approach as it involves testing of huge volumes of structured and unstructured data. The tester has to identify the rate at which the system consumes data from different data sources and the speed at which the Map-Reduce jobs or queries are executed. Along with this, the testers also have to check the sub-component performance and check how each individual component performs in isolation.
Performance testing a Big Data Application needs the testers take a defined approach that begins with:
- Setting up of the application cluster that needs to be tested.
- Identifying the designing the corresponding workloads.
- Preparing individual custom scripts.
- Executing the test and analyzing the results.
- Re-configuring and re-testing components that did not perform.optimally.
Since the testers are dealing with very large data sets that originates from hyper-distributed environments, they need to make sure that they verify all this data faster. To enable that, testers need to automate their testing efforts. However, since most of the automation testing tools are yet not skilled enough to handle unexpected problems that could arise during the testing cycle and the absence of a single tool that can perform the end-to-end testing, automating Big Data application testing requires technical expertise, great testing skills, and knowledge.
Big Data applications hold much promise in today’s dynamic business environment. But to appreciate its benefits testers have to employ the right test strategies, improve testing quality and identify defects in the early stages to deliver not only on application quality but cost as well.