Senior Site Reliability Engineer, Wikimedia Enterprise
Members only
Kenya, South Africa, Ghana, Morocco (Remote)
USD 116,633 – 181,243/yr
RemoteFull-timeSenior
Posted today
What you bring
Skills and Experience
Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
Familiarity with Wikimedia or other open source projects is a plus.
Qualities that are important to us
Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements
Customer and reliability focus: Prioritizes user experience by balancing availability, performance, and cost
Adaptability and learning: Comfortable working in a fast-evolving environment and learning new tools and technologies as needed
Additionally, we'd love it if you have
Experience managing and troubleshooting event streaming platforms at scale (e.g., Kafka, Kinesis, or similar)
Hands-on experience with cloud platforms such as AWS and/or GCP, including designing and operating production systems
Familiarity with data lake architectures and large-scale data processing frameworks (e.g., Iceberg, Flink, Spark)
Experience with continuous profiling and performance optimization tools to identify bottlenecks and improve system efficiency
Experience working with or contributing to open source projects, particularly in infrastructure or data ecosystems
Prior participation in the Wikimedia movement
GK
This is a preview of the role
Sign in to your GoKazini account to see the company name, full job details, salary information, and how to apply.