I recently started experimenting with Jaeger for tracing requests in our internal GraphQL API server, along with a service we maintain that uses it to allow users to easily access job logs. Watching demos is one thing, but it’s something else to see how easily it was to incorporate OpenTracing into these Go services and see the full end-to-end trace for a request (well, as far as I could go before running into unintrumented services). This got me thinking about how else we could leverage OpenTracing and Jaeger. What if we could trace users' jobs in our batch system, from submission, to execution, to data transfers and API calls?

The first step was to add tracing to our submission tool, which is a fairly simple Python command-line application. Then I figured I just needed to inject the trace ID into the job environment, and then add instrumentation to the wrapper shell script that actually runs the user’s workflow. I didn’t see any tools for this use case, so this weekend I wrote a simple Go tool that gets the parent span ID and the starting timestamp from environment variables, adds any tags the user wants, and reports a span to Jaeger. It can also wrap a subcommand for direct (and more precise) timing, as well as automatically determining error state based on the command’s exit code.

It was very cool seeing that first job trace. Now we just need to add tracing to our data transfer library. And the file server. And the metadata API server. And…