SRE / Monitoring: (devOps with strong development skills)
What you’ll be doing:
Your role will focus on the development of the platform core and common platform services. You’ll solve problems related to complex cloud-infrastructure automation, multi-region networking, authentication/authorization, logging/metrics collection at scale. To provide tooling and frameworks for engineering teams for transaction tracing, performance analysis, business monitoring and alerting.
Observability is key to building a quality product, helping the business move fast and resolving issues as soon as possible. The team is responsible for designing, building and running a comprehensive observability stack for a diverse range of products and architectures. In addition, this team leads the charge on defining good observability practices company wide and making it easier to build a better product.
• Lead/contribute to engineering efforts from design to implementation, solving complex technical challenges around monitoring distributed systems at scale.
• Drive the roadmap for the Observability platforms in conjunction with cross-functional partners. Bring together multiple perspectives and be the key connector in this important and highly visible role
• Build, lead and mentor an Observability team; create an environment of teamwork, trust, and mutual success
• Participate in deep technical design discussions within your team, across partner teams, and ensure that we’re building the right systems
• Drive adoption of best practices in monitoring, alerting, and performance.
• Work closely with development teams to implement monitoring & observability instrumentation within their platforms.
• Participate in an 24/7 oncall rotation for Monitoring & Observability services.
• Containerization & Container Orchestration (i.e. Docker, Kubernetes)
• Cloud Infrastructure Automation (Azure strongly preferred)
•
Qualifications
• Bachelors in Computer Science, related field, or equivalent work experience
• Previous experience delivering Observability at scale is required.
• Distributed Systems Development (e.g. asynchronous communication patterns, consensus algorithms, distributed transactions)
• Services Programming (e.g. Go-lang, Java, Kotlin, Scala, Clojure, Python, Ruby)
• Experience working with Linux systems
• Experience with monitoring and alerting systems
• Experience designing and building reliable systems at scale
Preferred Qualifications:
• Experience with distributed tracing systems. Jaegar / Open Zipkin
• Experience in Python/JS and/or Golang
• Strong interpersonal and collaborative skills
• Tool (e.g. Terraform, CDK, Pulumi, CloudFormation)
• Nice to have: CDK, CloudFormation, EKS, Istio, Envoy, Helm, Kubernetes Operator development, Go-lang, Prometheus (with Cortex), Kafka/Kinesis