You Broke My Tests!

11 replies

01:03 pm August 15, 2024

How can I reduce this problem?

Context:
10 scrum teams with devs and testers in each, in same ART, doing SAFe. All contributing to a monolith ecom site. Weekly global prod deploy. Testers from those 10 dev teams maintain a suite of ~ 2500 Selenium checks that run daily and mostly work (e.g., they find bugs and have a false positive rate of about 5%).

Problem: The false positive rate fluctuates +/- 5% because changes to the product-under-test by Team_A often break Selenium checks owned by Team_B and Team_C. Fixing the breaking tests is unplanned work.

Ian Mitchell

02:10 pm August 15, 2024

It sounds as though work is not being continually integrated and tested. The leap of faith being taken is seemingly at least a day and up to a week. Where are the teams at regarding BDD/TDD and automated regression tests, run not according to a schedule, but triggered upon a change to the codebase?

Thomas Owens

02:11 pm August 15, 2024

Are these tests run by the developers before integrating their work? If you have to wait for the daily run, you could end up integrating the work from multiple developers to find out if the work has "broken" a test. And once a test is "broken", you need to then determine why - if the test wasn't updated to account for the changes, if the test is highlighting a newly introduced defect, or if the test is a false positive. Getting this feedback earlier and closer to the work, such as on a development branch before merging it, can help understand the issue and get it corrected. The team should strive to have no broken tests in the integration branches.

Why are only Testers managing the tests? It's not clear how big your teams are or how many are dedicated to managing the tests, but sharing the workload can be beneficial. Having test case development and maintenance as part of the Definition of Done is useful. This also goes to the earlier running of test cases prior to integration.

Why do teams own tests? It seems like the teams don't have ownership of the product. Perhaps they are feature teams. This is often a less-than-optimal way of working. Teams may have more in-depth knowledge in certain aspects of the system, but when ownership introduces dependencies and gates value delivery, that introduces problems.

Eric Jacobson

06:03 pm August 15, 2024

Wonderful questions/ideas so far.

The ART is broken into value stream teams (e.g., one team focuses on Checkout, one on Browsing). The co-ownership thing didn't really work. We tried it. There are some gray areas and co-owned tests, but I think it has worked better for teams to know which tests they own, because they are better able to estimate the test maintenance workload resulting from the product-under-test (PUT) changes.

Why aren't we running full Selenium regression checks prior to each merge-to-mainline. Scale. These are slow tests. 2500 take four hours running on all 40 available concurrent test threads. With 15+ product code merges-to-mainline per day, it is impossible. Each 4-hour test suite run would queue up behind the active test suite run. Do the math. Even if we could afford to 15x our 40 CPU test threads, we would hit an environment bottleneck. Selenium tests require an integrated build/deploy to a performant prod-like environment. We only have one such env. Of course, we have other test automation (e.g., unit tests) that do run as a pre-merge task.

Shortening the feedback loop: So rather then running 2500 tests while a Story is being coded, testers run like 50 tests, focusing on the blast radius. The 2500 tests run nightly on the latest integrated code from each day. Results are checked in the AM and teams react.

Yeah, I'm constantly waving the flag, suggesting product coders also help write Selenium test code. It happens. But it also drives down dev happiness, unfortunately.

Ian Mitchell

07:14 pm August 15, 2024

Do the math.

Has the math really been done? Is it actually cheaper to incur test breakages, and to deal with their consequences, than to scale up automation environments and infrastructure? Improving technical performance first, even by orders of magnitude, is typically the lowest hanging fruit and the constraint to elevate.

Daniel Wilhite

07:46 pm August 15, 2024

Waiting for the full execution and then validation of the results is delaying the feedback received from the code changes made. How can you speed up that feedback loop? What do the Developers (in your case application coders and testers) suggest could help? Since they own the problem they should be part of the solution.

SAFe 6.0 has a System Team. One of the things that team is responsible for is assisting with end-to-end testing. That is what a Selenium test usually does. So, involving that team in these discussions would be a good idea. They may be able to help find ways of improving the ability to identify what Selenium tests are impacted by code changes. Also investigate the use of Shared Resources in this situation instead of embedding all testers into individual teams. Maybe having some of them serve as Shared Resources, you might be able to identify issues sooner.

SAFe has a philosophy of Built-In Quality. One of the concepts of that is the Shift Learning Left. Is there a way that you can shift the learning that a code change could impact a specific test to the "left" of the process instead of waiting until the latest possible moment to discover it?

I have a long background in software QA. I started doing it before QA started doing automation. The rest of my answer is based upon that background.

Do the math.

You are doing it wrong. Sorry but that is my opinion. Ever hear of the Testing Pyramid? What you have is 2500 tests that attempt to ensure everything works via the UI using Selenium. Those are the most expensive tests you can run and also the most frail (as your situation illustrates). Has any thought been put into creating more tests at the unit and integration levels? Those can be run faster, cheaper and with every check-in. They can validate that the application will work and be maintained by the developer as they write/update code. I suggest spending some time working on switching your methods to something that is cheaper and easier to maintain.

Damien Reed

09:37 pm August 15, 2024

A quick story from my experience ...

About 12 months ago we added a new column to our ADO board to "Check Automated Tests". The idea was to assess the impact each PBI had on existing tests and modify them before they broke, and also highlight any new tests required. We expanded our DoD to include this activity.

It appeared to work well for a while. Automated tests were becoming part of everyone's day-to-day, not just the QAs. There was a lot more consideration from everyone in maintaining them. However more recently the QAs agreed it was actually less effort to fix broken tests after the fact than to be proactive. We've since removed the column from the board and adjusted our DoD.

Daniel Wilhite

04:06 pm August 16, 2024

However more recently the QAs agreed it was actually less effort to fix broken tests after the fact than to be proactive.

This is an indicator that you are maintaining two code bases and one is dependent on the other. In essence, you have two products...the application and the automation. That is why I suggested investing more in unit and integration tests that are lower level. They run faster, they can be maintained while the code is, and they can give a level of certainty that the application will still work. In the organizations where I have been involved in implementing these, the "through-the-UI" test counts were greatly reduced and they only validated that the UI was working without trying to validate that the entire application worked. If you are trying to do CI/CD you absolutely need this. And if you have multiple products dependent on others, you will find a lot of frustration relieved.

Ian Mitchell

05:01 pm August 16, 2024

However more recently the QAs agreed it was actually less effort to fix broken tests after the fact than to be proactive. We've since removed the column from the board and adjusted our DoD.

What this has done is to turn a complex challenge (probe-sense-respond) into a chaotic one (act--sense-respond).

The ecom site is a monolith. The test architecture reflects this, and hence has been designed around expensive and top heavy UI tests. Monolithic designs breed technical debt which in turn breeds chaos. So it seems your choice now is to either:

throw resources at automation, until red-green-refactor becomes timely and workable, or
bite the bullet, and refactor the site architecture so lower-level unit and integration tests can be better supported
or both

Damien Reed

12:36 am August 19, 2024

Apologies, I should have mentioned my comment related to automated Regression Testing of the UI. Unit and Integration tests are managed separately to this with 80% code coverage.

René Gysenbergs

01:40 pm August 19, 2024

"Why aren't we running full Selenium regression checks prior to each merge-to-mainline. Scale. These are slow tests. 2500 take four hours running on all 40 available concurrent test threads."

Sorry, but your test automation approach is wrong.

You are using GUI tests to verify business functionality, your test automators should use integration tests instead. These run a lot faster, because they go through the code and bypass the GUI.

These integration test for example with Java use the Spring MVC framework and the junit-jupiter engine/API.

GUI tests with Selenium are only to test if the GUI has been broken, so only a happy path per screen, maybe a few fully E2E and triggering a few error messages, so really a minimum of GUI tests because they take the same time to execute automated as by person.

Pierre Pienaar

02:29 am August 20, 2024

I have a QA background, but was at first a bit stumped for an answer, as the situation described is not uncommon. 2500 UI tests and 4 hours run time is not that excessive with bigger projects. Automation has become the way teams guarantee product quality.

This question is crossing over into DevOps (no. 5-7), but a few things.
1. As said above, write more NUnit test and integration tests initiated from the business layer.
2. Look at duplication, and remove duplicated tests.
3. On the human side, people should investigate all fails on a submit, not only tests in their value stream.
4. Run the UI tests in headless more (but I assume you doing this already).
5. Investigate moving the tests into a high performance cloud environment, which is cheaper than to buy high performance hardware. (But scale up the environment)
6. Investigate to run tests in (multiple) containers in the cloud (say Azure), in headless mode, can speed up the process considerable.
7. Git source repositories allow for tests to run on a pull-request and before a submit. Normally a subset of tests is used, but allow for a set of tests to run before code can be submitted.