Automated Checks in JMeter
One of the fundamental questions in any business is how to spend time. Striving to systematically save time on every single step in a software development process is often overlooked. This paper discusses how automated software checks provide valuable insight into the state of a product, reducing the time it takes to diagnose any issues or suspicious behavior. We will also discuss how these automated checks may evolve as the software project develops.
In a software project, Mean Time To Diagnose (MTTD), Mean Time To Fix (MTTF) and other metrics dynamically change as the codebase grows, processes change, corrective actions take place, teams change, etc. However, all other factors being equal, as codebase grows (or more precisely as code complexity grows), the time to diagnose increases – and often increases exponentially – which can be, in some cases, measured in days.
The software issue in this context is any unexpected behavior: it may be a configuration problem, bug in processing, data problem, installation problem, problem with disk space, network problem, external service outage, etc. When performing issue diagnosis, precious time can be wasted by steering the investigation inthe wrong direction or by assigning resources to investigate multiple possible causes, but in general, investigation of complex issues will usually start by eliminating misleading clues. Here, we focus, in particular, on how to save time for the diagnosis by using automated checks to quickly collect pieces of information that help us eliminate misleading clues.
How to Check a Status of a Web-Service
Building a tool or infrastructure (usually around the Continuous Integration environment) for automated checks requires at least some customization and in-house development, simply because available tools do not support all project needs out of the box. This infrastructure should evolve together with the product and over time the number of automated checks should grow. Overall, the best person to define these kinds of checks is an engineer with a wide knowledge of the system design, who always starts from the question, “What kind of information does a certain check provide?” For example, if our product is using an external web service, we need to assure the service is working properly.
There are several possible checks for this example and using all of them together is advised:
- Check that service is up and running (simply put: ping). However, the service may be up and running but returning error response, therefore:
- Check that service returns expected response for the same request (one or more). Here, we put assertions for expected values. We define that we do not expect errors in response, etc. While repeating the same test is good to preserve compatibility, issues may be hidden somewhere else, therefore:
- Check the response for a valid request that uses one or more randomized values. These kinds of requests may not be comparable, depending on a product’s architecture, but they may catch some interesting bugs.
- Finally, check the negative case(s): deliberately send invalid request(s) and configure response assertions to look for specific error messages, as expected by design.
Checks that compare the values returned in the response vs. those stored in the database can also be added. Note that we did not explicitly state one important property of these (b,c) checks and that is whether they alter application data in any way – i.e. to GET something is not the same as to CREATE something. When checking the latter, the automated check is designed in a typical fashion:
- setup (clean the record if it already exists)
- test (create)
- teardown (cleanup, if necessary)
It is also a good idea to add to your automated checks at least one check that will always fail or be “red”, just to make sure that actual monitoring does not have any bugs that may produce invalid (optimistic) results.
Interpreting the Results
This simple example illustrates that every automated check has to be designed to probe the service or component from a different angle, because different combinations of PASS/FAIL communicate lead us to the real cause of the problem. Here are some scenarios:
- if (a) is FAIL and any of the remaining checks is PASS, then something may be wrong with the monitoring script
- if (a) and (c) are PASS and (b) is FAIL, then something may be wrong with the service: it may be a data error, error between the application and the database, etc.
- if all except (d) are PASS, then something may be wrong with error handling
- if there are many (c) requests that alternatively PASS and FAIL, then something may be wrong with the load-balancer or one of the nodes below it, etc.
The Setup
In our previous experience, we have successfully used Apache JMeter for automating checks like these. JMeter is an open source and free tool that supports many types of tests including SOAP web services, databases, FTP, HTTP, and it is fairly simple to add new custom components like the ones our team developed (for HBase, JMS, JSON, OAuth, SSH, comparing XMLs, etc.).
Once the checks and corresponding [regex/xpath/schema/…] assertions are in place, the automated script is refactored to decouple the environment configuration and input data (where needed) from the test. Decoupling environment configuration from the actual test is extremely important. It enables the team to re-use the same checks across different environments and also cuts down on script maintenance time. It is also easier to respond to environment changes, such as changes in service endpoints. A further step, where needed, is to decouple input data from the test. Input data may simply be moved from the script into corresponding input, e.g. CSV, file. Sometimes, for each row in the input file, one can define custom dynamic value assertions that can be read by the script (also supported by JMeter).
When the script is ready, the next step is configuring it to run automatically on a given interval, for example, every 10 minutes. Of course this will depend on how long it takes for all the checks to complete. To speed up the execution, the script may be configured to execute multiple checks in parallel – in JMeter this is accomplished by turning off the “Run Thread Groups Consecutively” option and organizing different checks in corresponding Thread Groups. Automated re-running of the script is setup with an in-house runner, but using JMeter, one can simply setup a cron job for this purpose because JMeter supports running from a Linux command line.
The team should decide how much historical data should be preserved. If test results artifacts (XML reports) consume significant space because tests are executed frequently, older artifacts may be archived and compressed so they are available in cases where later root cause analysis of an issue is performed and the team is looking for when a certain problem first occurred. Additional improvement on the resultreporting side enables result comparison from execution to execution, and also preserves information on builds against which the test was executed.
Fault Tree Analysis and "Hints"
A final improvement that can also save some time when interpreting the results (and it can make life easier for the new team members) is providing hints based on the test results. In this paper, we have implicitly stated that a result of a check is either PASS or FAIL, i.e. it is a boolean value. Therefore, a boolean decision table can be constructed and “hints” on what may be the possible cause can be defined. These hints can be either verbal (to help humans understand possible causes) or may be defined in a way to trigger certain actions (like a service restart) for a self-healing system. Implementing this fault tree analysis model includes some risk because a bug can exist in these complex models and that can abolish all the time-saving benefits that result from a much simpler implementation.
Conclusion
During the software development process, time to diagnose the issue increases as project complexity grows. A combination of monitoring and automated testing is introduced in order to shorten the diagnosis time and quickly react to any unexpected software behavior. In our described setup, we have used JMeter, an open source testing tool, which enabled us to create various automated checks, to parameterize environment configuration, input data, and assertions and to automatically re-run all checks on a given time interval. This type of setup will not only speed up time for promoting bug fixes and production, it will also unlock bigger code changes with confidence and it will decrease the number of reported issues which are not software bugs, but rather configuration or test environment problems.
Stay Connected: