Senior Site Reliability Engineer, IMF
Bloomreach Zobrazit všechny práce
- Česko
- Trvalý pracovní poměr
- Plný úvazek
- We're taking autonomous search mainstream, making product discovery more intuitive and conversational for customers, and more profitable for businesses.
- We're making conversational shopping a reality, connecting every shopper with tailored guidance and product expertise - available on demand, at every touchpoint in their journey.
- We're designing the future of autonomous marketing, taking the work out of workflows, and reclaiming the creative, strategic, and customer-first work marketers were always meant to do.
- System Administration:
- Manage and configure our Kubernetes components to ensure they are highly available, reliable, and perform well.
- Incident Management:
- Handle incident responses and perform root cause analysis for critical issues.
- Participate in a 24/7 on-call rotation, with each duty lasting 1 week. We aim to have 4 engineers in the rotation.
- Automation and Tools Development: Create and maintain scripts and tools using Python and Go to automate operations and reduce manual tasks.
- Scaling and Resource Planning:
- Monitor system performance and plan for future scaling.
- Ensure there are enough resources during peak times.
- Monitoring and Logging:
- Set up and maintain systems to monitor and log activities, so issues can be detected and addressed early.
- Backup and Recovery:
- Ensure our database has reliable backups and efficient tools for quick and smooth recovery.
- Collaboration:
- Work closely with other engineers and product managers to ensure successful project delivery.
- Collaborate with L2 support engineers to ensure seamless operations and effective problem resolution.
- Worked in DevOps or Site Reliability Engineering (SRE) before.
- Understand basic DevOps principles.
- Familiar with cloud platforms, especially Google Cloud Platform (GCP).
- It's important to know how to use Kubernetes.
- Know how to build and maintain CI/CD pipelines in GitLab or similar.
- Good at automating tasks and scripting with Python, Go, or Shell (for basic Linux tasks and Kubernetes management).
- Experienced in handling and resolving incidents.
- Know how to use monitoring tools such as VictoriaMetrics and Grafana.
- Familiar with logging tools.
- Good at analyzing issues and finding solutions.
- Can communicate well and work well with remote teams.
- Able to work on your own and manage multiple tasks.
- Comfortable working in a fast-paced environment.
- GitLab
- Victoria metrics, Grafana, InfluxDB, Chronograf, Sentry
- IMF (our in-memory database written in C++), Apache Kafka, MongoDB
- Kubernetes (GKE), Google Cloud Platform, gRPC
- Python, Go
- Get to know the team, the company, and key processes.
- Start working on your first tasks.
- Learn about our infrastructure, release process, tools, and product with our help.
- Take an active role in daily operations, including monitoring and incident management.
- Work on small automation projects to make routine tasks easier and improve efficiency.
- Help develop and maintain internal tools for monitoring, logging, and automation.
- Join the on-call rotation with support from experienced team members.
- Take ownership of specific tasks and projects, working independently.
- Contribute to scaling and resource planning to ensure the system can handle future growth and peak times.
- Understand the team's direction and help shape our future.