Our client is one of the largest game studios known for its very successful online MOBA and FPS franchises.
You will join a team responsible for the observability platform used by internal engineering teams to send metrics, logs, and traces to Datadog. The team builds and operates the telemetry pipelines, self-service tools, and supporting services that make onboarding, managing, and debugging observability data easier at scale.
In this role, you will work closely with product, platform, and service teams to help them integrate with the platform, troubleshoot telemetry issues, and improve data quality, reliability, and cost efficiency. You will also contribute to the internal tooling and platform capabilities that standardize how teams onboard telemetry and use observability across the organization.
Each team member is expected to proactively propose tools, designs, and implementation strategies.
Please note, availability to attend afternoon/evening meetings is a requirement for this role as most of the team is located on the US West Coast (LA and Seattle)
Responsibilities:
- Maintain telemetry ingestion and processing pipelines for metrics, logs, and traces.
- Build and maintain CI/CD, release, and validation automation for the observability platform.
- Manage monitoring and alerting as code, including Datadog monitors, dashboards, workflows, and Terraform-based configuration.
- Improve reliability, data quality, and cost efficiency of the telemetry platform by investigating missing data, noisy signals, pipeline regressions, and usage or billing drivers.
- Partner with product, platform, and service teams to onboard new integrations, guide telemetry migrations, and standardize implementation patterns.
- Prepare technical documentation, runbooks, migration guides, and implementation proposals, and participate in code reviews and design discussions.
- Provide operational support for teams using the observability platform, helping them onboard telemetry to Datadog and troubleshoot telemetry issues across metrics, logs, and traces.
- Plan, communicate, and execute safe rollouts for shared observability infrastructure
**Required qualifications: **
- Minimum of 4 years commercial work experience
- Strong hands-on experience with at least one backend language, preferably Go.
- Practical experience with observability and monitoring concepts: metrics, logs, traces, alerting, dashboards, and telemetry troubleshooting.
- Experience with Datadog or a comparable observability platform.
- Hands-on experience with Kubernetes-based deployments and enough familiarity to understand how telemetry is configured, injected, and debugged in Kubernetes environments.
- Familiarity with Infrastructure as Code, preferably Terraform.
- Experience building or supporting CI/CD and release automation using tools such as Docker, Jenkins, GitHub Actions and Helm.
- Solid debugging and analytical skills, especially in distributed systems or telemetry or data pipelines.
- Good communication skills and ability to work effectively across teams and time zones.
- Bachelor's or higher degree in Computer Science, Software Engineering, or a related field
- Experience with version control tools (e.g., Git)
- General knowledge of test automation
**Nice to have: **
- Experience with Vector, VRL, OpenTelemetry, Prometheus/OpenMetrics or Prometheus remote write.
- Experience with authentication and secrets systems such as Okta OIDC, Vault, or similar tooling.
- Experience leading migrations from legacy telemetry patterns to standardized platform solutions.
- Experience managing observability costs and telemetry cardinality..
- Working knowledge of OpenTelemetry/APM instrumentation and how application libraries, agents, collectors, and vendor platforms interact.
- Hands-on experience working with Public Cloud, preferably AWS.