Site Reliability Engineer
Remote job description
We are looking for a Site Reliability Engineer
You will be a key member of a tight-knit group of talented Engineers who are responsible for keeping ours and our customer's Kubernetes clusters operational and healthy. You'll also have a key role in the development of the product itself, working together with our Platform Engineers to deliver the greatest Kubernetes service possible.
Giant Swarm is a fast-growing open-source infrastructure management platform used by modern enterprises. Our vision is to empower developers around the world to ship great products. We are a diverse, fully remote (since 2014) and experienced team that is growing and spread across Europe - with a headquarters in Cologne.
- You maintain, operate and upgrade our own and our customer's Kubernetes clusters.
- You will design, configure, build, and maintain our core infrastructure, from kernel parameters to the cloud provider templates.
- You understand how servers and systems work and you tweak their behavior to your needs.
- You will be responsible for our monitoring, logging and alerting.
- You will help resolve incidents on our own and our customer's clusters.
- You participate in the on-call support schedule
- You are a go-to person in case our developers need advice regarding infrastructure.
- You will automate all the things, and the thought of Terraform doesn't make you cry.
- We (and the majority of our customers) are currently mostly distributed around Europe (around UTC), thus, your main time zone should be somewhere between +/-2UTC to ensure better communication.
- You must have deep, hands-on knowledge of Kubernetes from both the end-user and the operational side.
- You're comfortable debugging systems at all levels, from kernel fundamentals right up to workloads running on Kubernetes.
- You're happy troubleshooting a wide variety of issues and you're not afraid to parse thousands of lines of logs in pursuit of an answer.
- You have good coding skills (preferably Go, but Python or similar is fine as well)
- You have experience with maintaining infrastructure with code and you know the pros and cons of various automation tools (We use Terraform & Ansible but Chef, Puppet and the lot is also a good start).
- You are fluent with Cloud Native Tools running on top of Kubernetes (prometheus, grafana, ingress controller, …) you know how to use them and how to configure them.
- You automate all the things by writing code. Using bash scripts makes you sad :)
Every new team member changes the team.
We love to learn from each other and people who know things we don't are highly welcome. And even though we are almost 70 people we aim at putting the individual first when taking decisions, establishing processes, etc. You'll find that from day one, your work will make a difference and will be highly valued. There are no meaningless tasks and you'll soon realize that the company is full of people who are passionate about their jobs. Our strong culture of failure helps us stay up to date and try new things. Even though we've been fully remote since 2014, we still like to meet in person twice a year at our onsites (make sure you check out our Instagram ;) ) as well as at conferences and events (as soon as they start again). Continuous learning is important to us - we foster this through bi-yearly personal development talks, a budget for training/certifications/coaching as well as regular feedback talks. Becoming part of Giant Swarm means that, by extension, you also become part of the Cloud Native community. We actively contribute to upstream projects and our quarterly hackathons will give you space to work on out-of-the-box projects. Occasionally, when we, as a team, want to fully focus on one project, we scratch all meetings and routines for a certain time to better focus during our hive-sprints.
- We don't count holiday (our team members take between 25-35 days off on average)
- Choose your own hard- and software
- As a company that has almost, if not more, kids than employees, family-friendliness is crucial to us and paid parental leave is a no-brainer.
- Healthcare compensation
- Fixed monthly budget for buying cat pictures or your mobile phone contract/ co-working space if you are boring ;)
- We aim to be fully transparent (finance, salaries, communication, etc.)
We failed in exactly describing our way to approach important company elements that can be described with 'buzzwords' such as agile mindset, cross-functional teams, self-organization, value of the individual or trust & teamwork. However, we truly care about them, we live them and we constantly iterate on them. Some snippets about how we do this are posted in our blog but by far not all of them.
Pro tip: Ask whenever something is unclear!Summary
Company: Giant Swarm
Job title: Site Reliability Engineer - (100% remote)
Job tags: Kubernetes, Golang, Cloud Native, K8s