Staff Site Reliability Engineer
Remote job description
Founded in 2015, Apollo is a leading sales intelligence and engagement platform trusted by over 15,000 paying customers, from rapidly growing startups to the largest global enterprises. Our platform unifies a database of 200 million business contacts with advanced intelligence and engagement tools, to help over 500,000 sales, marketing, and recruiting professionals to connect with the right person at the right time with the right message, at speed and scale.
In the last year, we've grown ARR 3x, quadrupled our active users, maintained profitability 18 out of the past 20 months, and recently closed a $110M Series C led by Sequoia Capital to fuel the next phase of our growth.
Working at Apollo
We are a remote-first inclusive organization focused on operational excellence. Our way of working ensures clear expectations and an environment to do your best work with ample reward.
Your Role & Mission:
As the Staff Site Reliability Engineer, this role will have direct input into how we scale, secure, and monitor our systems and services throughout the entire organization. You will lead the work on our Infrastructure team made up of experienced Systems Engineers with a diverse background and collaboratively build upon our cutting-edge infrastructure platform. Apollo Engineering strongly believes in allowing team members to take ownership of what they do, and our approach to problem-solving relies heavily upon creativity, communication, and collaboration.
Daily Adventures & Responsibilities:
As the Staff Site Reliability Engineer, you will work on:
- Infrastructure Ownership
- Monitor the production environment for availability and taking a holistic view of system health
- GCP support, operation and scale to ensure high availability of all systems
- Build software and systems to manage platform infrastructure and applications
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Making high-level decisions on technologies we use to create and deploy our applications.
- Monitoring systems implementation and support, expanding to provide a comprehensive monitoring solution, At a Glance system statuses.
- Daily & Monthly Responsibilities
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational support and engineering for multiple large distributed software applications
- Create sustainable systems and services through automation and uplifts
- Balance feature development speed and reliability with well-defined service level objectives
- Project Tasks
- Task prioritization and project level estimation.
- Use of modern project management software such as JIRA to organize epics, stories and tickets and track progress.
- Requirements gathering required for projects.
- Breaking larger tasks down into smaller tasks and identifying the order in which they should be completed.
- Using agile project management techniques like stand-ups and weekly sprint planning.
- Infrastructure Configuration and Support
- GKE Cluster provisioning, maintenance and upgrades.
- Proficiency in Auto Scaling techniques and Deployment strategies for high availability and self-healing infrastructure.
- Database support with MongoDB, ElasticSearch, RedisCache and BigQuery.
- GCP Networking, Firewalls, Load Balancers, Virtual Private Clouds, etc ..
- API Gateway, Ingresses
- Cross-Functional Collaboration
- Communicate technical ideas to software developers in written and verbal formats.
- Communicate the team's progress on key projects and metrics to engineering management.
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
Experience Required to apply for this role:
- 10+ years of experience of related experience working in a production environment
- Strong background in Linux/UNIX (Debian/Ubuntu)
- Experience with Google Cloud Platform, preferably with multiple GCP services
- Information Security best practices, particularly in the context of GCP and BigData
- Experience with configuration management & automation (Ansible, Terraform)
- Monitoring and Metrics gathering (GCP Logs, Prometheus, NewRelic)
- Familiarity with Linux Containers and Virtualization (Docker)
- Experience working with Kubernetes, Docker Compose.
- Knowledgeable in networking protocols (TCP/IP, DNS, TLS, IPSEC, etc.)
- Strong interest in learning new and emerging technologies
- General experience with NoSql, RDBMS (Access Control, Administration, Tuning, etc.)
- Experience validating and deploying software to cloud infrastructure, including running unit tests, producing build artifacts, and running end to end tests.
- Ability to organize and drive an infrastructure project using Agile project management tools and techniques.
- Strong problem-solving abilities and technical reasoning.
- Ability to work cross-functionally with other engineering teams.
- Performance benchmarking and capacity planning
- Knowledge of microservices architecture
What You'll Love About Apollo
Besides the great compensation package and culture that thrives in openness and excellence, we invest tremendous effort into developing our remote employees' careers. The team embraces that we have a sole purpose: to help customers maximize their full revenue potential on the Apollo platform. This mindset opens us up to a lot of creative approaches to making customers successful at scale. You'll be a significant part of a lean, remote team, empowered to really own your role as a proactive educator. We're very collaborative at Apollo, so you'll be able to lean on your teammates, even in adjacent departments, to help you achieve lofty goals. You'll be supported and encouraged to experiment and take educated risks that lead to big wins. And, you'll have a whole team remotely by your side to help you do it!Summary
Company name: Apollo
Remote job title: Staff Site Reliability Engineer
Job tags: big data, Prometheus, DBMS