Senior Site Reliability Engineer- Observability

okta•📍 Bengaluru, India

Apply

🌍 Remote

Match Analysis — Why you qualify

✓You can work fully remotely from anywhere.

Job Description

Secure Every Identity, from AI to Human

Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence.

This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

Position Overview
We are seeking a highly technical Senior Site Reliability Engineer (P3) to help build, run, and scale Okta’s enterprise-grade multi-cloud observability ecosystem. In this role, you will be a core engine driving our full-stack telemetry infrastructure—ensuring that massive streams of Metrics, Logs, Traces, and Alerts are processed efficiently, securely, and cost-effectively.
As a Senior SRE, you will treat Monitoring as Code (MaC). You will utilize automation frameworks like Terraform and write robust code (Go/Python) to build self-healing pipelines, optimize backend telemetry engines (Splunk and Grafana/Mimir/Loki), and eliminate manual operations.

About Team: Workforce Identity Cloud
Okta Workforce Identity Cloud (WIC) provides easy, secure access for your workforce so you can focus on other strategic priorities—like reducing costs, and doing more for your customers.
If you like to be challenged and have a passion for solving large-scale automation, testing, and tuning problems, we would love to hear from you. The ideal candidate is someone who exemplifies the ethics of, “If you have to do something more than once, automate it” and who can rapidly self-educate on new concepts and tools.

Key Responsibilities
- Full-Stack Telemetry Operations: Own and optimize the end-to-end collection, processing, and visualization pipelines for Metrics, Logs, Traces, and Alerts across highly distributed multi-cloud (AWS/GCP) environments.
- Splunk & Grafana Optimization: Act as a hands-on expert in optimizing log pipelines. Drive indexer performance, tune search efficiency (SPL), and clean up heavy dashboard queries to reduce latency and infrastructure footprint (FinOps).
- Monitoring as Code (MaC): Standardize, deploy, and maintain core observability tools, agent relays, and collectors natively using Terraform and automated CI/CD pipelines.
- Distributed Tracing & Metrics: Implement and scale OpenTelemetry (OTel) standards, Prometheus/Mimir, and tracing frameworks to map end-to-end request flows across core microservices (such as our Project Harmony initiative).
- Alert & Dashboard Governance: Implement smart, programmatic alerting guardrails to combat alert fatigue. Deflate noise by building intelligent, actionable alert pathways that route directly to auto-remediation workflows.
- Operational Drive: Help lead the execution against our operational backlog, eliminating systemic technical debt through automation and structural engineering changes.
- On-Call & Incident Co-Pilot: Participate in on-call rotations, providing tier-3 technical escalation support. Run technical post-incident reviews to convert major outages into programmatic observability checks.
Required Skills & Experience (The Essentials)
- Experience: 5+ years of dedicated experience in an SRE, DevOps, or Platform Engineering role managing highly resilient, large-scale distributed systems.
- Log Analytics Mastery (Splunk): Deep, practical experience with Splunk administration, search optimization, cluster maintenance, and writing highly efficient SPL queries at scale.
- Full-Stack Tooling: Hands-on proficiency with major observability tooling suites, specifically Grafana, Prometheus, Loki/Mimir, Cortex, or equivalent cloud-native stacks.
- Telemetry Standards: Practical experience instrumenting applications and infrastructure using OpenTelemetry (OTel), Prometheus metrics, or Jaeger/Tempo distributed tracing.
- Strong Programming Skills: Highly proficient in Go (Golang) or Python for building internal automation, custom exporters, and engineering custom SRE tooling.
- Infrastructure as Code: Solid experience writing, modularizing, and executing production-grade Terraform configurations.
- Cloud & Containers: Deep understanding of Linux internals, core networking protocols (TCP/IP, DNS, TLS), and container orchestration platforms like Amazon EKS / Kubernetes.
Bonus Skills (The "Nice-to-Haves")
- AI & Agentic SRE: Experience or interest in building AI/LLM-driven troubleshooting assistants, automated alert triaging, or smart pattern-matching tools.
- Multi-Cloud Networking: Experience bridging telemetry across hybrid cloud footprints (AWS to GCP).
- Security & Compliance: Familiarity with logging compliance standards, such as Federal STIGs or FIPS-compliant data handling.

#LI-Hybrid

P24819_3370625

The Okta Experience

We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.

Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran. We also consider for employment qualified applicants with arrest and convictions records, consistent with applicable laws.

If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please use this Form to request an accommodation.

Notice for New York City Applicants & Employees: Okta may use Automated Employment Decision Tools (AEDT), as defined by New York City Local Law 144, that use artificial intelligence, machine learning, or other automated processes to assist in our recruitment and hiring process. In accordance with NYC Local Law 144, if you are an applicant or employee residing in New York City, please click here to view our full NYC AEDT Notice.

Ready to apply?

Submit your application directly on the official hiring portal.

Apply at okta →