Site Reliability Engineer
Remote (UTC +/- 2 )

In a nutshell
Opportunity to apply your knowledge of Site Reliability Engineering to deploying and supporting complex, customisable technology to distributed customer owned cloud accounts.

Our company
The open-source technology at Snowplow empowers people to differentiate with data. Running on AWS and GCP data technologies, it is the platform for teams who want to serve complex data use cases in an increasingly privacy and security conscious world.
We collect, validate, enrich and load billions of events for our customers each day, who also benefit from our online experience and expertise in running our own tech.
It's an exciting time here at Snowplow. We are actively selling in 14 countries with hundreds of customers and thousands of open-source users; Snowplow is well placed to weather the ongoing economic storm, having recently closed its Series A2 fundraising with Atlantic Bridge and MMC Ventures.

The Opportunity
Our Private SaaS offering has grown significantly over the past year and we now orchestrate and monitor Snowplow event pipelines across more than 150 customer-owned AWS & GCP sub-accounts. Each account has its own individualised and optimised stack and all are capable of processing many billions of events per month. We have a new trial experience that helps prospects self-serve a cut down experience of Snowplow, also in their own cloud.

We are looking for another SRE to help us grow to managing 1,000 and then 10,000 AWS, GCP & Azure accounts. You will be pioneering solutions to managing estates of this size through cutting edge monitoring and automation. You'll work closely with our Head of Tech Ops on all aspects of our proprietary deployment, orchestration and monitoring stacks. Within all of our domains (full service, trial and our own) we are striving to increase service reliability, fulfil customer requests in a timely fashion, and automate recurring tasks. Task automation is essential, given our infrastructure estate scales linearly with our customer numbers, unlike most software businesses.

The challenge of automating the maintenance and deployment of thousands of individualised stacks is an enormously ambitious undertaking and a hugely exciting infrastructure automation challenge.

What you'll be doing
Maintaining and developing our growing Terraform infrastructure-as-code stacks which we use to deploy infrastructure for all internal and client use cases
Maintaining our internal infrastructure stacks which include the HashiCorp suite as well as our Snowplow Insights UI and VPNs
Participating in our on-call rotation to help us serve our client base 24/7
Taking rotations of L3 Technical Support where you will be responsible for triaging and dealing with infrastructure issues
Handling high-severity internal or customer incidents, ensuring we meet all SLAs

We'd love to hear from you if
Has worked with AWS and/or GCP in a production capacity (Azure is a bonus)
Has worked with Terraform, CloudFormation or some form of infrastructure-as-code tooling
Any experience with the HashiCorp stack (Vault, Consul, Nomad) and understanding their role in infrastructure automation is a bonus
Has worked with Docker and is familiar with container-based architectures (Kubernetes is a bonus)
Knowledgeable about the Linux operating system and how to manage servers in a production capacity
Knowledgeable about Cloud networking principles and how to troubleshoot issues in this space
Comfortable scripting in one or more of: Bash, Python, Ruby or Perl
Comfortable programming in one or more of: Java, Scala, Golang or Python
Experience working with online marketplaces would be a bonus

What you'll get in return
A competitive package, including share options
25 days of holiday a year (plus public holidays)
Freedom to work from wherever suits you best
Cycle to work scheme
Two fantastic company Away Weeks in a different European city each year (or when this isn't possible, we have "Stay Away Weeks")
Mental health support including therapy sessions
Work alongside a supportive and talented team with the opportunity to work on cutting edge technology and challenging problems
Grow and develop in a fast-moving, collaborative organisation
MacBook Pro
Convenient location in central London for those who want to work there
Continuous supply of Pact coffee and healthy snacks in the office when you're here!

Summary
Company: Snowplow Analytics
Job title: Site Reliability Engineer at Snowplow Analytics (London, UK) (allows remote)
Job tags: aws, python, azure