About the Role

The Senior Site Reliability Engineer (SRE) will bring deep expertise designing and supporting highly-scalable, highly-available infrastructure and applications in Kubernetes, as well as promoting microservice design patterns in complex working environments within the cloud. This role will serve as a subject matter expert on all aspects of our containerized deployments, including deployment, configuration, scaling, and upgrades. The ideal candidate will be passionate about mentoring other team members and departments on the adoption of new technologies and design principles, as well as promoting DevOps culture and collaboration. This role will also work closely to ensure deployments are successful in both production and non-production environments.

What You'll Do

Troubleshoot complicated, cross platform issues handling OS, AWS, networking and databases
Work closely with Development, QA and Production Support teams to make sure releases are on time and successful
Ensure the reliability and security of the infrastructure while building proactive dynamic monitoring, alerting and metrics solutions to make sure each environment is meeting the SLA requirements
Build infrastructure in both AWS and GCP using Terraform
Seek to minimize or eliminate manual hand-offs and to also link all automated workflows
Support the Kubernetes application/infrastructure in both production and non-production environments
Establish and test disaster recovery policies and procedures
Responsible for resiliency and scalability of the infrastructure
Track and apply all required patches
Demonstrate experience in the creation and management of technical documentation

Skills & Qualifications

BA in Computer Science or Information Systems or combination of education and related work experience
5 years of Site Reliability experience (SRE)
5 years of DevOps experience
2 years with Kubernetes experience
3 years with Cloud Platform experience, AWS and GCP
5 years with Production infrastructure experience
Strong coding experience in Ruby, Python or similar languages
Proven experience to automate routine repeatable tasks
Strong sense of ownership, ability to work independently and proven track record of driving products and changes
Strong experience in production support and operations
Strong experience in monitoring application / infrastructure performance and availability while creating metrics for management use
Strong experience in Terraform, Ansible, Jenkins, Linux, Docker, Helm, Elasticsearch, Prometheus
Strong automation, problem-solving skills, and ability to follow through to completion
Ability to wear multiple hats and multitask effectively in a fast paced environment
Capable of working independently as well as part of a group

Summary
Trumid Technologies LLC
Site Reliability Engineer (SRE) at Trumid Technologies LLC (New York, NY) (allows remote)

Tags: kubernetes, cloud, python, devops, aws