Building Pex with Pants in Production
August 9, 2017 by
GIPHY’s search analytics team has one of the coolest jobs in the world—getting data about the way the world uses GIFs and using that data to make it easier to find the best cat fail. To make this happen, we write ETLs in Scala with Spark and manage them with a Python-based task management tool called Luigi that helps us schedule our code to run across multiple workers.
Robustly deploying Python code is not one of the coolest jobs in the world. Our initial solution involved managing remote repositories and virtualenvs on our workers, and doing a git pull and rebuild per-host when we wanted to update our code. This system was especially painful given that the slowest step was compiling our Scala code, which could easily be built into portable JARs locally if we didn’t have the Python component to worry about.
Around the time it became a real pain point for us, we migrated to Pants as a unified build tool for our Scala & Python code. Pants supports building PEX targets—”Python EXecutables,” which self-contain their own dependencies and can be shipped to remote hosts as easily as JARs. In this post, I’ll explain the broad strokes of how PEX works and share some of our learnings from using it in production with Pants.
Pants makes it possible to build a PEX for a “hello world”-style application with a few lines of boilerplate. Imagine this is my application:
> cat hello/hello.py print "hello this is dog"
The directory containing my code might look like this:
> ls hello/ BUILD hello.py
That BUILD file tells Pants how to build the code contained in this directory. In this case, I want it to express that I’d like to build a PEX target containing the functionality of hello.py.
> cat hello/BUILD python_binary( name='hello', source='hello.py', )
Once I build that target, I have an executable .pex file that I can invoke directly with python.
> pants binary hello/ ... > python ./dist/hello.pex hello this is dog
Behind the scenes, that ./dist/hello.pex file is just a directory containing my code. In fact, it’s a zipped directory with a #!/usr/bin/env python that enables it to be invoked like you see above. All of this is actually standard Python functionality since version 2.6—the PEX format simply exploits the convenience of those features and adds a __main__.py file that handles loading in packaged dependencies and setting up the environment for portable execution. If I went ahead and unzipped ./dist/hello.pex, I’d see something like this:
> ls unzipped/ PEX-INFO __main__.py __main__.pyc hello.py hello.pyc
Brian Wickman gave an excellent talk with more examples if you’re interested in learning more about the format. In the rest of my post, I’ll explain some of the things we’ve learned about using PEX at GIPHY.
Since PEX files are just zipped directories containing my application (and its dependencies), they’re relatively easy to introspect on and debug. To return to my previous example, the unzipped version of my PEX file contains my unaltered application code.
> cat unzipped/hello.py print "hello this is dog"
This is handy for debugging the state of the deployed code in a pinch, but has also proved useful for investigating subtler issues. During our migration, we ran into an issue with a Python package with an odd structure that didn’t play nicely with the PEX format, and were able to figure out what was going on by unzipping the executable and introspecting on the dependency structure.
Building for Portability
PEXes store all of their dependencies as wheels in a hidden directory. If I were to add a couple of Python dependencies to my project, I could see their corresponding wheel files in my unzipped PEX directory like this:
> ls unzipped/.deps/ platform-agnostic-dep-none-any.whl platform-specific-dep-macosx_10_11_intel.whl platform-specific-dep-none-linux_x86_64.whl
This approach enables the PEX to run on both my local environment and our Linux workers, but requires that wheels for every relevant platform be available at build time.
It wasn’t immediately obvious to us what the best way to manage these files was—we didn’t want to explode the size of our git repository with large wheels and weren’t eager to stand up our own web-based solution. In the end, we opted to track our wheels with git lfs and register their containing directory with Pants, which has been a pretty lightweight solution.
The payoff for introducing this format has indeed been speedier, less fussy deploys! We use Fabric to scp our Scala and Python executables to each worker, storing them under a directory named for the commit hash of the deployed version and updating a symlink to track the latest deploy.
> ls -l /mnt/deploys/ 48b10732b5d022e13e9a5ec85e0c00f4ac2f93f1 78ca5ad8b6585715a4abdfbf51b83ac25ae8c177 c6299f8df863d80130e015923d22c6d9130e39dc latest -> 78ca5ad8b6585715a4abdfbf51b83ac25ae8c177
We keep the last several deployed directories available on the remote workers, which enables us to roll back to a previously-deployed version of our code with the touch of a symlink.
The major challenge we ran into was some poorly-defined Pants build caching behavior around our platform-specific Python dependencies. We sometimes had to manually invalidate the Pants cache to build our Python executables by rming the relevant cache directories.
There are many other ways we could have chosen to improve on our initial system, including containerization or even debian packages, but building and deploying with the PEX format has been a straightforward solution that’s served us well. It’s vastly simplified our deploys while minimizing the number of new technologies in our stack and keeping our build process very consistent—in fact, we can compile our Fabric-based deploy scripts into PEXes as well!
– Fiona Condon, Search Engineer