As a Site Reliability Engineer (SRE) at Upbound, you'll be a vital part of the production services the company is building its business on. You'll be applying engineering principles to design and build highly reliable and scaled infrastructure and services, deployment pipelines and processes to frequently and safely release updates, and monitoring and alerting systems to ensure it all stays healthy.

In this role, you will be?

Taking ownership of the health and reliability of the live production service and infrastructure, ensuring that SLOs/SLAs are consistently met
Designing, building, and automating critical portions of the Upbound Cloud service infrastructure
Troubleshooting and problem-solving effectively to remediate infrastructure related issues that affect service health
Reporting and fixing bugs in private and public projects.
Providing routine maintenance and support of Kubernetes based infrastructure, including extending Kubernetes API and functionality via CRD/Controller applications
Entrusted to make technology decisions for the business, procuring the right technology and designing and implementing a self-service solution for the teams that consume Upbound infrastructure
Collaborating with the development teams to assess and recommend technologies that support company organizational needs
Balancing tradeoffs between enterprise and open source technologies to better serve Upbound
Supporting the full project lifecycle - discovery, analysis, architecture, design, documentation, building, migration, automation, and production-readiness

Summary
Company: Upbound
Job title: Site Reliability Engineer at Upbound (Tulsa, OK) (allows remote)
Job tags: kubernetes, golang, cloud, api, infrastructure