End to End Testing on PRs

At LY Corporation we're constantly working to improve our pre-release test process and reduce the risk of outages. One key piece of this process is our end-to-end testing tool, which connects to our messaging server in the same way as the LINE app, allowing us to test changes to the server with more fidelity than unit or integration tests. Our end-to-end testing follows the principle of testing at every level of the hierarchy: we have far fewer test cases (a few hundred, compared to nearly ten thousand unit tests) covering a much narrower set of scenarios, but they allow us to confirm our system's behavior when making real network calls to external systems and storages. We've been adding to and improving this tool for several years, and we've just recently implemented the ability to run it on individual development branches during the PR process, allowing us to detect issues earlier, iterate faster, and hopefully improve our test coverage even further.

Running this tool at the PR stage has actually been a goal for several years, but various improvements to our testing environment from multiple teams at LY Corporation meant it finally looked to be within reach when my team started this attempt earlier this year. Former-LINE components have traditionally been deployed on either bare-metal servers or individual VMs, managed with our internal ops tools, which meant that any new environment required manual provisioning. But recently our HBase team has released a standalone "HBase in a box" container that lets us easily spin up an isolated storage environment for local testing (which I'd just enhanced by automatically running our table creation scripts), and our Redis team isn't far behind with a Kubernetes-based local test cluster configuration. We felt confident setting up a Kafka container, which meant all of the main storages used by the messaging server could be made available in a containerized deployment.

We could also build on a major step towards full containerization from my own team: our "beta container" environments, which allow deploying a PR branch as a container on a shared Kubernetes cluster for testing, using our shared beta storage instance. This was a very valuable intermediate step, and meant we'd already debugged many of the issues associated with running our messaging server in containerized form - especially integrating with the many other services that need to call back into the messaging server, some of which needed to add new mechanisms or filters to route those callbacks to a specific test instance. (In the initial prototyping stages we even hoped to expand this approach to support our end-to-end-testing tool, but ultimately we realized that fully isolated storage would be necessary if we were going to automatically run tests on every PR, as there's too much risk of one branch's code writing invalid data to shared storage and disrupting every other developer's work. With less than 20 PRs to this codebase in a typical day, the resource requirement for running on every PR isn't a huge increase - we were already making a comparable number of scheduled test runs - but isolation is a must).

With the stage set, we were able to proceed with the project. Unlike a new code feature, there wasn't a large amount of code to be written; instead, there were a lot of fiddly edge cases to understand and track down. Often the result of several days' frustrating debugging would be just a single line change, or even a note on our internal wiki to explain how to work around a problem. But we broke the problem down into manageable pieces, and slowly but surely built up to a working system. For some examples of the tasks we faced:

Our end-to-end testing tool codebase had become somewhat convoluted as additional features were added, and was written for an older 2.x version of Scala. Before making further changes to this codebase, we migrated it to use an up-to-date 3.3 LTS version. In turn this meant we had to first update and align various library dependencies. We also took the opportunity to do some general cleanup and refactoring
Since we are starting many interdependent containers automatically, having reliable health checks became very important. A simple "is the port open" check proved insufficient, particularly for HBase, which starts listening on a port early in its startup process before it's actually ready to serve requests. So we worked with our HBase team to add a more thorough health check to the "HBase in a box" container (and also to ensure the health check respected configurations such as non-default ports)
Our HBase client is deliberately configured to limit the number of requests in flight (and fail-fast if too many requests are made, rather than queueing or some other behavior). While this is important for avoiding excessive load on the HBase servers, the configuration designed for our (fast, but shared between many developers) alpha cluster was a little too strict; we adjusted it to fit a slower but isolated containerized configuration where one server's load will never disrupt other developers' work.
Some developers use ARM-based Macs as development machines, but some of our containers are only available for x86-64. Since we wanted to ensure that everyone can debug the test setup locally, we needed to figure out a VM-based setup that could run all the containers efficiently, and then document the steps on our internal wiki so that everyone could use a consistent configuration
Our pull request environment made Docker Compose V2 available, whereas our existing container configurations had generally been used with Docker Compose V1. Although we didn't encounter any concrete incompatibilities (that we know of), we decided to migrate to Docker Compose V2 everywhere for consistency, particularly to reduce the risk of mismatched behavior when debugging
Some of our Kafka client code had an assumption that we would always have a secondary Kafka cluster configured as failover. Although we may eventually want to reproduce that setup in all our test environments, for now we modified the code to accept a single-cluster configuration
Our internal network firewalls (designed to prevent unwanted connections between development and production environments) initially blocked our containers from accessing some upstream services, since our build farm was classified as a production system. Understanding and fixing this was made more difficult by changes to our network topology happening in parallel, as part of the ongoing work of integrating former-Yahoo and former-LINE systems. Fortunately, colleagues from our Armeria team with experience of debugging low-level network issues were able to help out
The Kafka container was unable to resolve the Apache Zookeeper container by hostname in some network configurations. While I would have liked to dive deeply into the details of this problem, ultimately we took the pragmatic approach of documenting a configuration that worked reliably and recommending that that configuration should be used for all development work
Changes to our messaging server were constantly ongoing and sometimes our fellow developers accidentally broke an end-to-end test without noticing (since our end-to-end-tests currently only run as a scheduled task against our shared development branch). So we also needed to follow up on any end-to-end test breakages that happened during development, and of course confirm whether they represented false positives or "real" bugs being introduced into the messaging server. While we are continually making efforts to improve the reliability of our end-to-end tests, they can fundamentally never be as reliable as a traditional unit test, for a few reasons:
- Parts of our system that we need to test are eventually consistent - sometimes our messaging server will send a task to a Kafka topic, and another component consumes that topic and writes data to our storage system based on these tasks. Even within a single component, we sometimes perform writes to storage (particularly HBase) asynchronously in order to avoid adding latency to our request processing. So when checking that a request to the messaging server has resulted in correct writes to storage, our end-to-end tester may need to retry with a backoff, and this can cause flakiness.
- Operating end-to-end means test setup is a lot more complex and time consuming. For example, when a user unregisters from LINE, cleaning up their data is a relatively heavy async process. In the real environment this is not a concern - unregistering is a relatively rare operation - and so it hasn't been a high priority for optimization. But most of our test cases create several test users and unregister them afterwards, so this ends up being a significant overhead.
- While we now deploy containerized versions of our storage services for these end-to-end tests, there are other upstream services for which we still rely on shared test instances - often shared not just among our developers but with several other teams, not least the maintainers of those upstream services. So when there is an issue or failed deployment on that upstream team's "beta instance", it can disrupt our testing as well.
We upgraded the JVM bytecode version for our end-to-end testing tool in parallel with this work, so we needed to align our testing pipeline with that change

Eventually we had a pipeline that worked consistently, and it was time to share with the wider team. End-to-end testing at the PR stage is now undergoing a gradual rollout in our internal development process, and we hope it will catch more outages before they happen and allow us to deliver new LINE messaging features with confidence!

If you'd like to work on these kinds of things with us, we're hiring for Messaging Backend Engineers and related positions.