Our client is looking for Lead Site Reliability Engineer to join their team. As the Lead SRE you will be responsible for software development and infrastructure design building services to manage, scale and monitor their shared core infrastructure. Those infrastructure and service responsibilities include databases, message queues, monitoring solutions, security and networking in the cloud and physical data-centers. This person should be enthusiastic to join a fast-paced environment and be able to lead and motivate changes in efficiency, resiliency and scalability of the client’s shared resources used by many of their core product services.
- Ability to streamline and augment day-to-day operations of shared services in a 24x7x365 environment located in AWS, and physical data centers.
- Use a variety of open source technologies to foster fault-tolerant, scalable and secure services and pipelines on a global scale.
- Build tools to enhance performance, scalability and observability of resources shared between multiple projects in production.
- Collaborate with teams across the organization to define KPIs and encourage best practices in relation to performance and reliability.
- Improve observability to ensure the runtime and reliability of the organizations infrastructure and applications.
- Troubleshoot issues across the entire organization’s ecosystem; hardware, software, application and network within physical data center and cloud-based environments.
- Provide on-call support for infrastructure and shared services.
- Lead a small team, providing mentorship, guidance and expertise.
- Ability to provide project management, oversight and reporting for your team.
Required Skills & Qualifications
- Proven track record to work in and lead a distributed team.
- Deep experience working in an AWS environment.
- Prior proven success of designing, building, optimizing, and maintaining infrastructure on a large scale.
- Experience with Postgres, DynamoDB, Redis, and/or Memcached and other AWS Services.
- Software development using Go, Python and Ruby.
- A deep understanding of the Linux operating system, console to kernel.
- Knowledge of CI/CD best practices.
- Experience with containers and container orchestration tools (Docker, Kubernetes and Spinnaker experience preferred).