|
Posted by: stak
Tags: mozilla
Posted on: 2019-07-15 09:22:44
Don't panic.
I spent a good chunk of this year fiddling with taskcluster configurations in order to get various bits of continuous integration stood up for WebRender. Taskcluster configuration is very flexible and powerful, but can also be daunting at first. This guide is intended to give you a mental model of how it works, and how to add new jobs and modify existing ones. I'll try and cover things in detail where I believe the detail would be helpful, but in the interest of brevity I'll skip over things that should be mostly obvious by inspection or experimentation if you actually start digging around in the configurations. I also try and walk through examples and provide links to code as much as possible.
Based on my experience, there are two main kinds of changes most Gecko hackers might want to do:
(1) modify existing Gecko test configurations, and
(2) add new jobs that run in automation.
I'm going to explain (2) in some detail, because on top of that explaining (1) is a lot easier.
Overview of fundamentals
The taskcluster configuration lives in-tree in the taskcluster/ folder. The most interesting subfolders there are the ci/ folder, which contain job defintiions in .yml files, and the taskgraph/ folder which contain python scripts to do transforms. A quick summary of the process is that the job definitions are taken from the .yml files, run through a series of "transforms", finally producing a task definition. The task definitions may have dependencies on other task definitions; together this set is called the "task graph". That is then submitted to Taskcluster for execution. Conceptually you can think of the stuff in the .yml files as a higher-level definition, which gets "compiled" down to the final taskgraph (the "machine code") that Taskcluster actually understands and executes.
It's helpful to walk through a quick example of a transform. Consider the webrender-linux-release job definition. At the top of the kind.yml file, there are a number of transforms listed, so each of those gets applied to the job definition in turn. The first one is use_toolchains, the code for which you can find in use_toolchains.py. This transform takes the toolchains attributes of the job definition (in this example, these attributes), and figures out the corresponding toolchain jobs and artifacts (artifacts are files that are outputs of a task), The toolchain jobs are defined in taskcluster/ci/toolchains/, so we can see that the linux64-rust toolchain is here and the wrench-deps toolchain is here. The transform code discovers this, then adds dependencies for those tasks to webrender-linux-release, and populates the MOZ_TOOLCHAINS env var with the links to the artifacts. Taskcluster ensures that tasks only get run after their dependencies are done, and the MOZ_TOOLCHAINS is used by ./mach artifact toolchain to actually download and unpack those toolchains when the webrender-linux-release task runs.
Similar to this, there are lots of other transforms that live in taskcluster/taskgraph/transforms/ that do other transformations on the job definitions. Some of the job definitions in the .yml files look very different in terms of what attributes are defined, and go through many layers of transformation before they produce the final task definition. In a sense, each folder in taskcluster/ci is a domain-specific language, with the transforms eventually compiling them down to the same machine code.
One of the more important transforms is the job transform, which determines which wrapper script will be used to run your job's commands. This is usually specified in your job definition using the run.using attribute. Examples include mach or toolchain-script. These both get transformed (see transforms for mach and toolchain-script) into commands that get passed to run-task. Or you can just use run-task directly in your job, and specify the command you want to have run. In the end most jobs will boil down to run-task commands; the run-task wrapper script just does some basic abstraction over the host machine/VM and then runs the command.
Host environment and Docker
Taskcluster can manage many different underlying host systems - from Amazon Linux VMs, to dedicated physical macOS machines, and everything in between. I'm not going to into details of provisioning but there's the notion of a "worker" which is useful to know. This is the binary that runs on the host system, polls taskcluster to find new tasks scheduled to be run on that kind of host system, and executes them in whatever sandboxing is available on that host system. For example, a docker-worker instance will start a specified docker image and execute the task commands in that. A generic-worker instance will instead create a dedicated work folder and run stuff in there, cleaning up afterwards.
If you're adding a new job (e.g. some sort of code analysis or whatever) most likely you should run on Linux using docker. This means you need to specify a docker image, which you can also do using the taskcluster configuration. Again I will use the webrender-linux-release job as an example. It specifies the webrender docker image to use, which is defined with an in-tree Dockerfile. The docker image is itself built using taskcluster with the job definition here (this job definition is an example of one that looks very different from other job definitions, because the job description is literally two lines and transforms do most of the work in compiling this into a full task definition).
Circling back to generic-worker, this is the default for jobs that need to run on Windows and macOS, because Docker doesn't run either of those platforms as a target. An example is the webrender-macos-debug job, which specifies using: run-task on a t-osx-1010 worker type, which will cause it go through this transform and eventually run using the run-task wrapper under generic worker on the macOS instance. For the most part you probably won't need to care about this but it's something to be aware of if you're running jobs targetted at Windows or macOS.
Caching
As we've seen from previous sections, jobs can use docker images and toolchain artifacts that are produced by other tasks in the taskgraph. Of course, we don't want to rebuild these docker images and toolchains on every single push, as that would be quite expensive. Taskcluster provides a mechanism for caching and reusing artifacts from previous pushes. I'm not going to go into too much detail on this, but you can look at the cached_tasks transform if you're interested. I will, however, point to the %include comments in e.g. this Dockerfile and note that these include statements are special because the mark the docker image as dependent on the given file. So if the included file changes, the docker image will be rebuilt. For toolchain tasks you can specify additional dependencies on inputs using the resources attribute; this also triggers toolchain rebuilds if the dependent inputs change.
The other thing to keep in mind when adding a new job is that you want to avoid too much network traffic or redundant work. So if your job involves downloading or building stuff that usually doesn't change from one push to the next, you probably want to split up your job so that the mostly-static part is done by a toolchain or other job, and the result of that is cached and reused by your main job. This will reduce overall load and also improve the runtime of your per-push job.
Even if you don't need caching across pushes, you might want to refactor two jobs so that their shared work is extracted into a dependency, and the two dependent jobs then just do their unique postprocessing bits. This can be done by manually specifying the dependencies and pulling in artifacts from those dependencies using the fetches attribute. See here for an example. In this scenario taskcluster will again ensure the jobs run in the right order, and you can use artifacts from the dependencies, but no caching across pushes takes place.
Adding new jobs
So hopefully with the above sections you have a general idea of how taskcluster configuration works in mozilla-central CI. To add a new job, you probably want to find an existing job kind that fits what you want, and then add your job to that folder, possibly by copy/pasting an existing job with appropriate modifications. Or if you have a new type of job you want to run that's significantly different from existing ones, you can add a new kind (a new subfolder in taskcluster/ci and documented in kinds.rst). Either way, you'll want to ensure the transforms being used are appropriate and allow you to reuse the features (e.g. toolchain dependencies) that you need and that already exist.
Gecko testing
The Gecko test jobs are defined in the taskcluster/ci/test/ folder. The entry point, as always, is the kind.yml file in that folder, which lists the transforms that get applied. The tests transform is one of the largest and most complex transforms. It does a variety of things (e.g. generating fission-enabled and fission-disabled tasks for jobs), but thankfully you probably won't need to fiddle with that too much, unless you find your test suite is behaving unexpectedly. Instead, you can mostly do copy-pasting in the other .yml files to enable test suites on particular platforms or adjust options. The test-platforms.yml file allows you define "test platforms" which show up as new rows on TreeHerder and run sets of tests on a particular build. The sets of tests are defined in test-sets.yml, which in turn reference the individual test jobs defined in the various other .yml files in that folder. Enabling a test suite on a platform is generally as easy as adding the test to the test set that you care about, and maybe tweaking some of the per-platform test attributes (e.g. number of chunks, or what trees to run on) to suit your new platform. I found the .yml files to mostly self-explanatory so I won't walk through any examples here.
What I will briefly mention is how the tests are actually run. This is not strictly part of taskcluster but is good to know anyway. The test tasks generally run using mozharness, which is a set of scripts in testing/mozharness/scripts and configuration files in testing/mozharness/configs. Mozharness is responsible for setting up the firefox environment and then delegates to the actual test harness. For example, running reftests on desktop linux would run testing/mozharness/scripts/desktop_unittest.py with the testing/mozharness/configs/unittests/linux_unittest.py config (which are indicated in the job description here). The config file, among other things, tells the mozharness script where the actual test harness entrypoint is (in this case, runreftest.py) and the mozharness script will invoke that test harness after doing some setup. There's many layers here (run-task, mozharness, test harness, etc.) with the number of layers varying across platforms (mozharness scripts for Android device/emulator are totally different than desktop), and I don't have as good a grasp on all this as I would like, but hopefully this is sufficient to point you in the right direction if you need to fiddle with tests at this level.
Debugging
As always, when modifying configs you might run into unexpected problems and need to debug. There are a few tools that are useful here. One is the ./mach taskgraph command, which can run different steps of the "decision" task and generate taskgraphs. When trying to debug task generation issues my go-to technique would be to download a parameters.yml file from an existing decision task on try or m-c (you can find it in the artifacts list for the decision task on TreeHerder), and then run ./mach taskgraph target-graph -p parameters.yml. This runs the taskgraph code and emits a list of tasks that would be scheduled given the taskcluster configuration in your local tree and the parameters provided. Likewise, ./mach taskcluster-build-image and ./mach taskcluster-load-image are useful for building and testing docker images for use with jobs. You can use these to e.g. run a docker image with your local Docker installation, and see what all you might need to install on it to make it ready to run your job.
Another useful debugging tool as you start doing try pushes to test your new tasks, is the task/group inspector at tools.taskcluster.net. This is easily accessible from TreeHerder by clicking on a job and using the "Inspect task" link in the details pane (bottom left). TreeHerder provides some information, but the Taskcluster tools website provides a much richer view including the final task description that was submitted to Taskcluster, its dependencies, environment variables, and so on. While TreeHerder is useful for looking at overall push health, the Taskcluster web UI is better for debugging specific task problems, specially as you're in the process of standing up a new task.
In general the taskcluster scripts are pretty good at identifying errors and printing useful error messages. Schemas are enforced, so it's rare to run into silent failures because of typos and such.
Conclusion
There's a lot of power and flexibility afforded by Taskcluster, but with that goes a steep learning curve. Thankfully once you understand the basic shape of how Taskcluster works, most of the configuration tends to be fairly intuitive, and reading existing .yml files is a good way to understand the different features available. grep/searchfox in the taskcluster/ folder will help you out a lot. If you run into problems, there's always people willing to help - as of this writing, :tomprince is the best first point of contact for most of this, and he can redirect you if needed.
|