Blog



All timestamps are based on your local time of:

[ « Newer ][ View: List | Cloud | Calendar | Latest comments | Photo albums ][ Older » ]

The Gecko Hacker's Guide to Taskcluster2019-07-15 09:22:44

Don't panic.

I spent a good chunk of this year fiddling with taskcluster configurations in order to get various bits of continuous integration stood up for WebRender. Taskcluster configuration is very flexible and powerful, but can also be daunting at first. This guide is intended to give you a mental model of how it works, and how to add new jobs and modify existing ones. I'll try and cover things in detail where I believe the detail would be helpful, but in the interest of brevity I'll skip over things that should be mostly obvious by inspection or experimentation if you actually start digging around in the configurations. I also try and walk through examples and provide links to code as much as possible.

Based on my experience, there are two main kinds of changes most Gecko hackers might want to do:
(1) modify existing Gecko test configurations, and
(2) add new jobs that run in automation.
I'm going to explain (2) in some detail, because on top of that explaining (1) is a lot easier.

Overview of fundamentals

The taskcluster configuration lives in-tree in the taskcluster/ folder. The most interesting subfolders there are the ci/ folder, which contain job defintiions in .yml files, and the taskgraph/ folder which contain python scripts to do transforms. A quick summary of the process is that the job definitions are taken from the .yml files, run through a series of "transforms", finally producing a task definition. The task definitions may have dependencies on other task definitions; together this set is called the "task graph". That is then submitted to Taskcluster for execution. Conceptually you can think of the stuff in the .yml files as a higher-level definition, which gets "compiled" down to the final taskgraph (the "machine code") that Taskcluster actually understands and executes.

It's helpful to walk through a quick example of a transform. Consider the webrender-linux-release job definition. At the top of the kind.yml file, there are a number of transforms listed, so each of those gets applied to the job definition in turn. The first one is use_toolchains, the code for which you can find in use_toolchains.py. This transform takes the toolchains attributes of the job definition (in this example, these attributes), and figures out the corresponding toolchain jobs and artifacts (artifacts are files that are outputs of a task), The toolchain jobs are defined in taskcluster/ci/toolchains/, so we can see that the linux64-rust toolchain is here and the wrench-deps toolchain is here. The transform code discovers this, then adds dependencies for those tasks to webrender-linux-release, and populates the MOZ_TOOLCHAINS env var with the links to the artifacts. Taskcluster ensures that tasks only get run after their dependencies are done, and the MOZ_TOOLCHAINS is used by ./mach artifact toolchain to actually download and unpack those toolchains when the webrender-linux-release task runs.

Similar to this, there are lots of other transforms that live in taskcluster/taskgraph/transforms/ that do other transformations on the job definitions. Some of the job definitions in the .yml files look very different in terms of what attributes are defined, and go through many layers of transformation before they produce the final task definition. In a sense, each folder in taskcluster/ci is a domain-specific language, with the transforms eventually compiling them down to the same machine code.

One of the more important transforms is the job transform, which determines which wrapper script will be used to run your job's commands. This is usually specified in your job definition using the run.using attribute. Examples include mach or toolchain-script. These both get transformed (see transforms for mach and toolchain-script) into commands that get passed to run-task. Or you can just use run-task directly in your job, and specify the command you want to have run. In the end most jobs will boil down to run-task commands; the run-task wrapper script just does some basic abstraction over the host machine/VM and then runs the command.

Host environment and Docker

Taskcluster can manage many different underlying host systems - from Amazon Linux VMs, to dedicated physical macOS machines, and everything in between. I'm not going to into details of provisioning but there's the notion of a "worker" which is useful to know. This is the binary that runs on the host system, polls taskcluster to find new tasks scheduled to be run on that kind of host system, and executes them in whatever sandboxing is available on that host system. For example, a docker-worker instance will start a specified docker image and execute the task commands in that. A generic-worker instance will instead create a dedicated work folder and run stuff in there, cleaning up afterwards.

If you're adding a new job (e.g. some sort of code analysis or whatever) most likely you should run on Linux using docker. This means you need to specify a docker image, which you can also do using the taskcluster configuration. Again I will use the webrender-linux-release job as an example. It specifies the webrender docker image to use, which is defined with an in-tree Dockerfile. The docker image is itself built using taskcluster with the job definition here (this job definition is an example of one that looks very different from other job definitions, because the job description is literally two lines and transforms do most of the work in compiling this into a full task definition).

Circling back to generic-worker, this is the default for jobs that need to run on Windows and macOS, because Docker doesn't run either of those platforms as a target. An example is the webrender-macos-debug job, which specifies using: run-task on a t-osx-1010 worker type, which will cause it go through this transform and eventually run using the run-task wrapper under generic worker on the macOS instance. For the most part you probably won't need to care about this but it's something to be aware of if you're running jobs targetted at Windows or macOS.

Caching

As we've seen from previous sections, jobs can use docker images and toolchain artifacts that are produced by other tasks in the taskgraph. Of course, we don't want to rebuild these docker images and toolchains on every single push, as that would be quite expensive. Taskcluster provides a mechanism for caching and reusing artifacts from previous pushes. I'm not going to go into too much detail on this, but you can look at the cached_tasks transform if you're interested. I will, however, point to the %include comments in e.g. this Dockerfile and note that these include statements are special because the mark the docker image as dependent on the given file. So if the included file changes, the docker image will be rebuilt. For toolchain tasks you can specify additional dependencies on inputs using the resources attribute; this also triggers toolchain rebuilds if the dependent inputs change.

The other thing to keep in mind when adding a new job is that you want to avoid too much network traffic or redundant work. So if your job involves downloading or building stuff that usually doesn't change from one push to the next, you probably want to split up your job so that the mostly-static part is done by a toolchain or other job, and the result of that is cached and reused by your main job. This will reduce overall load and also improve the runtime of your per-push job.

Even if you don't need caching across pushes, you might want to refactor two jobs so that their shared work is extracted into a dependency, and the two dependent jobs then just do their unique postprocessing bits. This can be done by manually specifying the dependencies and pulling in artifacts from those dependencies using the fetches attribute. See here for an example. In this scenario taskcluster will again ensure the jobs run in the right order, and you can use artifacts from the dependencies, but no caching across pushes takes place.

Adding new jobs

So hopefully with the above sections you have a general idea of how taskcluster configuration works in mozilla-central CI. To add a new job, you probably want to find an existing job kind that fits what you want, and then add your job to that folder, possibly by copy/pasting an existing job with appropriate modifications. Or if you have a new type of job you want to run that's significantly different from existing ones, you can add a new kind (a new subfolder in taskcluster/ci and documented in kinds.rst). Either way, you'll want to ensure the transforms being used are appropriate and allow you to reuse the features (e.g. toolchain dependencies) that you need and that already exist.

Gecko testing

The Gecko test jobs are defined in the taskcluster/ci/test/ folder. The entry point, as always, is the kind.yml file in that folder, which lists the transforms that get applied. The tests transform is one of the largest and most complex transforms. It does a variety of things (e.g. generating fission-enabled and fission-disabled tasks for jobs), but thankfully you probably won't need to fiddle with that too much, unless you find your test suite is behaving unexpectedly. Instead, you can mostly do copy-pasting in the other .yml files to enable test suites on particular platforms or adjust options. The test-platforms.yml file allows you define "test platforms" which show up as new rows on TreeHerder and run sets of tests on a particular build. The sets of tests are defined in test-sets.yml, which in turn reference the individual test jobs defined in the various other .yml files in that folder. Enabling a test suite on a platform is generally as easy as adding the test to the test set that you care about, and maybe tweaking some of the per-platform test attributes (e.g. number of chunks, or what trees to run on) to suit your new platform. I found the .yml files to mostly self-explanatory so I won't walk through any examples here.

What I will briefly mention is how the tests are actually run. This is not strictly part of taskcluster but is good to know anyway. The test tasks generally run using mozharness, which is a set of scripts in testing/mozharness/scripts and configuration files in testing/mozharness/configs. Mozharness is responsible for setting up the firefox environment and then delegates to the actual test harness. For example, running reftests on desktop linux would run testing/mozharness/scripts/desktop_unittest.py with the testing/mozharness/configs/unittests/linux_unittest.py config (which are indicated in the job description here). The config file, among other things, tells the mozharness script where the actual test harness entrypoint is (in this case, runreftest.py) and the mozharness script will invoke that test harness after doing some setup. There's many layers here (run-task, mozharness, test harness, etc.) with the number of layers varying across platforms (mozharness scripts for Android device/emulator are totally different than desktop), and I don't have as good a grasp on all this as I would like, but hopefully this is sufficient to point you in the right direction if you need to fiddle with tests at this level.

Debugging

As always, when modifying configs you might run into unexpected problems and need to debug. There are a few tools that are useful here. One is the ./mach taskgraph command, which can run different steps of the "decision" task and generate taskgraphs. When trying to debug task generation issues my go-to technique would be to download a parameters.yml file from an existing decision task on try or m-c (you can find it in the artifacts list for the decision task on TreeHerder), and then run ./mach taskgraph target-graph -p parameters.yml. This runs the taskgraph code and emits a list of tasks that would be scheduled given the taskcluster configuration in your local tree and the parameters provided. Likewise, ./mach taskcluster-build-image and ./mach taskcluster-load-image are useful for building and testing docker images for use with jobs. You can use these to e.g. run a docker image with your local Docker installation, and see what all you might need to install on it to make it ready to run your job.

Another useful debugging tool as you start doing try pushes to test your new tasks, is the task/group inspector at tools.taskcluster.net. This is easily accessible from TreeHerder by clicking on a job and using the "Inspect task" link in the details pane (bottom left). TreeHerder provides some information, but the Taskcluster tools website provides a much richer view including the final task description that was submitted to Taskcluster, its dependencies, environment variables, and so on. While TreeHerder is useful for looking at overall push health, the Taskcluster web UI is better for debugging specific task problems, specially as you're in the process of standing up a new task.

In general the taskcluster scripts are pretty good at identifying errors and printing useful error messages. Schemas are enforced, so it's rare to run into silent failures because of typos and such.

Conclusion

There's a lot of power and flexibility afforded by Taskcluster, but with that goes a steep learning curve. Thankfully once you understand the basic shape of how Taskcluster works, most of the configuration tends to be fairly intuitive, and reading existing .yml files is a good way to understand the different features available. grep/searchfox in the taskcluster/ folder will help you out a lot. If you run into problems, there's always people willing to help - as of this writing, :tomprince is the best first point of contact for most of this, and he can redirect you if needed.

[ 0 Comments... ]

The Girl in the Plain Brown Wrapper2019-07-11 20:36:11

Book #18 of 2019 is The Girl in the Plain Brown Wrapper, last one of the Travis McGee books available at my local library. Now I need to find something else. This one was slightly better than average of the lot, I'd say.

[ 0 Comments... ]

The Green Ripper2019-07-11 20:34:35

Book #17 of 2019 is The Green Ripper, yet another of the Travis McGee books. This one was a little meh. There's a similar Jack Reacher book, although of course this one came first.

[ 0 Comments... ]

The Lonely Silver Rain2019-07-01 15:18:41

Book #16 of 2019 is The Lonely Silver Rain, again of the Travis McGee series. Also good. My local library only has a couple more of the series (which is part of the reason I've been reading them out of order) so I might as well finish those before moving on to something else.

[ 0 Comments... ]

Cinnamon Skin2019-07-01 15:15:47

Book #15 of 2019 is Cinnamon Skin, of the Travis McGee series. I liked this one more the ones I've read so far because it was all about the nitty-gritty of the chase. Surprisingly few dead ends though (seems to be a theme in his book - somebody hypothesizes something and it invariably turns out to be truth).

[ 0 Comments... ]

The Dreadful Lemon Sky2019-06-25 15:34:08

Book #14 of 2019 is The Dreadful Lemon Sky, also of the Travis McGee series. The implausible thing about this one was the number of villians.

[ 0 Comments... ]

Free Fall in Crimson2019-06-19 19:54:19

Book #13 of 2019 is Free Fall in Crimson, another of the Travis McGee books by John D. MacDonald. This guy doesn't hold back any punches, that's for sure. I was pretty surprised/shocked by the way the book turned out towards the end, but won't go into details to avoid spoilers.

[ 0 Comments... ]

The Life-Changing Magic of Tidying Up2019-06-16 19:59:06

Book #12 of 2019 is The Life-Changing Magic of Tidying Up, by Marie Kondo. This was an interesting one. Although I liked some of her ideas on how to go about tidying up, I think a lot of it is impractical for me, at least as described in the book. Perhaps in individual consultations she's able to customize her suggestions and tailor them better to the client. Still an interesting read, and it was good to be exposed to ideas from different cultures which I wouldn't have otherwise come across.

[ 0 Comments... ]

Nightmare in Pink2019-06-09 15:14:15

Book #11 of 2019 is Nightmare in Pink by John D. MacDonald. #3 of the Travis McGee series. Slightly less fun than the previous one because I wasn't totally satisfied with the way he manages to extricate himself from the situation near the end.

Also something I realized I prefer about MacDonald's writing compared to Lee Child (for instance) is that he doesn't go into a blow-by-blow account of physical action/fights. I find that I tend to mostly gloss over those parts of other books since it's not worth trying to build a mental video of the fight because it's a lot of effort and only the outcomes of the fight are relevant to the plot. MacDonald likewise tends to gloss over the details which saves me some time :)

[ 0 Comments... ]

The Deep Blue Good-by2019-06-09 15:08:42

Book #10 of 2019 is The Deep Blue Good-by by John D. MacDonald. It's the first of the novels featuring the Travis McGee character (see my previous blog post). It was pretty decent, I thought. I'm amused by the random tangential, almost poetic/stream-of-conciousness paragraphs that the author throws in once in a while. Reminds me of George Carlin. My local library doesn't have the whole series but I guess I'll read whatever they do have.

[ 0 Comments... ]

[ « Newer ][ View: List | Cloud | Calendar | Latest comments | Photo albums ][ Older » ]

 
 
(c) Kartikaya Gupta, 2004-2019. User comments owned by their respective posters. All rights reserved.
You are accessing this website via IPv4. Consider upgrading to IPv6!