Site Reliability Engineer
Booming Games Malta Ltd.
Remote job description
We are growing and our Operations Department is looking for support to join our international team!
Responsibilities
- Daily interactions ensuring the health and maintenance of systems in different geographical locations: hardware, software, application and network are operating at peak performance
- Perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
- Troubleshoot issues across the entire stack: hardware, software, application and network
- Drive standardization efforts across multiple disciplines and services in conjunction with SREs throughout the organization
- Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services
- Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
- Work with software engineers to improve upon deployment processes
- Participate in the on-call rotation for production systems
Requirements
- Sound fundamentals in operating systems, networking, and distributed systems
- Strong familiarity with Linux systems administration and management best practices
- Familiarity with container technologies: Kubernetes, CRI, Docker, namespaces, cgroups
- Strong understanding of: Ethernet, VLANs, IPv4/IPv6, ARP, DHCP, DNS, and TCP
- Familiarity with distributed system problems: leader election, Raft consensus, etc.
- Solid understanding of systems and application design, including the operational trade-offs of various designs
- Expert level understanding with at least one public or private cloud technology such as Amazon AWS, Google GKE, or OpenStack
- Practical knowledge of various aspects of service design, including messaging protocols and behavior, caching strategies and software design practices
- Practical intermediate knowledge of shell scripting, some Ruby is a plus
- Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures
- Excellent knowledge of Linux/UNIX systems administration and performance tuning
- Comfortable configuring DNS, DHCP, and LAN/WAN technologies
- Minimum 5 years of managing services in an internet scale *nix environment
- Must be able to communicate well with technical as well as non-technical colleagues to achieve business goals
- Must be adaptable and able to focus on the simplest, most efficient and reliable solutions
- Track record of successful practical problem solving, excellent written and interpersonal communication in English, and documentation skills
- Curiosity and an interest in networking, systems software, and distributed systems
- Experience as a systems administrator or operations engineer
- Experience with a 24/7 production environment
- Experience with managed deployments providing software, platforms, or infrastructure as a service
- Experience with Mellanox and Vyatta based networking gear is a plus
- Experience with SuperMicro server and storage gear is a plus
Booming Games Malta Ltd.
Site Reliability Engineer (Remote) at Booming Games Malta Ltd. (allows remote)
Tags: kubernetes, containers, cni, ceph, provisioning
-
location or timezone
(GMT+01:00) Berlin -
category
DevOps and SysAdmin -
posted
1191 days ago
https://www.remote.io/remote-devops-and-sysadmin-jobs/site-reliability-engineer-11141