Senior Site Reliability Engineer, IMF

Česko
Trvalý pracovní poměr
Plný úvazek

Před 2 měsíci

Bloomreach is building the world's premier agentic platform for personalization.We're revolutionizing how businesses connect with their customers, building and deploying AI agents to personalize the entire customer journey.

We're taking autonomous search mainstream, making product discovery more intuitive and conversational for customers, and more profitable for businesses.
We're making conversational shopping a reality, connecting every shopper with tailored guidance and product expertise - available on demand, at every touchpoint in their journey.
We're designing the future of autonomous marketing, taking the work out of workflows, and reclaiming the creative, strategic, and customer-first work marketers were always meant to do.

And we're building all of that on the intelligence of a single AI engine - Loomi AI - so that personalization isn't only autonomous…it's also consistent.From retail to financial services, hospitality to gaming, businesses use Bloomreach to drive higher growth and lasting loyalty. We power personalization for more than 1,400 global brands, including American Eagle, Sonepar, and Pandora.We are looking for a dedicated DevOps Engineer to join our Analytics team and manage our in-memory database (IMF) and related services. Our system runs on Google Cloud Platform (GCP) and Kubernetes and integrates with Kafka, MongoDB, and other services. Your job will be to keep our databases and services running smoothly, maintain reliable monitoring, and develop tools and automation for new releases, maintenance, and incident management.The team works remotely in the Central European Time Zone. We are happy to meet you in Brno, Prague (Czechia) or Bratislava (Slovakia), where our headquarters is located.Responsibilities

System Administration:

Manage and configure our Kubernetes components to ensure they are highly available, reliable, and perform well.
Incident Management:

Handle incident responses and perform root cause analysis for critical issues.
Participate in a 24/7 on-call rotation, with each duty lasting 1 week. We aim to have 4 engineers in the rotation.
Automation and Tools Development: Create and maintain scripts and tools using Python and Go to automate operations and reduce manual tasks.
Scaling and Resource Planning:

Monitor system performance and plan for future scaling.
Ensure there are enough resources during peak times.
Monitoring and Logging:

Set up and maintain systems to monitor and log activities, so issues can be detected and addressed early.
Backup and Recovery:

Ensure our database has reliable backups and efficient tools for quick and smooth recovery.
Collaboration:

Work closely with other engineers and product managers to ensure successful project delivery.
Collaborate with L2 support engineers to ensure seamless operations and effective problem resolution.

QualificationsExperience:

Worked in DevOps or Site Reliability Engineering (SRE) before.
Understand basic DevOps principles.
Familiar with cloud platforms, especially Google Cloud Platform (GCP).
It's important to know how to use Kubernetes.
Know how to build and maintain CI/CD pipelines in GitLab or similar.

Skills:

Good at automating tasks and scripting with Python, Go, or Shell (for basic Linux tasks and Kubernetes management).
Experienced in handling and resolving incidents.

Tools:

Know how to use monitoring tools such as VictoriaMetrics and Grafana.
Familiar with logging tools.

Problem-solving:

Good at analyzing issues and finding solutions.

Communication:

Can communicate well and work well with remote teams.

Adaptability:

Able to work on your own and manage multiple tasks.
Comfortable working in a fast-paced environment.

Our stack

GitLab
Victoria metrics, Grafana, InfluxDB, Chronograf, Sentry
IMF (our in-memory database written in C++), Apache Kafka, MongoDB
Kubernetes (GKE), Google Cloud Platform, gRPC
Python, Go

Your success story.First 30 Days:

Get to know the team, the company, and key processes.
Start working on your first tasks.
Learn about our infrastructure, release process, tools, and product with our help.

First 90 Days:

Take an active role in daily operations, including monitoring and incident management.
Work on small automation projects to make routine tasks easier and improve efficiency.
Help develop and maintain internal tools for monitoring, logging, and automation.
Join the on-call rotation with support from experienced team members.

First 180 Days:

Take ownership of specific tasks and projects, working independently.
Contribute to scaling and resource planning to ensure the system can handle future growth and peak times.
Understand the team's direction and help shape our future.

#LI-KP1More things you'll like about Bloomreach:Culture:A great deal of freedom and trust. At Bloomreach we don't clock in and out, and we have neither corporate rules nor long approval processes. This freedom goes hand in hand with responsibility. We are interested in results from day one.We have defined our and the 10 underlying key behaviors that we strongly believe in. We can only succeed if everyone lives these behaviors day to day. We've embedded them in our processes like recruitment, onboarding, feedback, personal development, performance review and internal communication.We believe in flexible working hours to accommodate your working style.We work virtual-first with several Bloomreach Hubs available across three continents.We organize company events to experience the global spirit of the company and get excited about what's ahead.We encourage and support our employees to engage in volunteering activities - every Bloomreacher can take 5 paid days off to volunteer*.The elaborates on our stellar 4.4/5 rating. The Culture score is even higher at 4.9/5Personal Development:We have a People Development Program -- participating in personal development workshops on various topics run by experts from inside the company. We are continuously developing & updating competency maps for select functions.Our resident communication coach is available to help navigate work-related communications & decision-making challenges.*Our managers are strongly encouraged to participate in the Leader Development Program to develop in the areas we consider essential for any leader. The program includes regular comprehensive feedback, consultations with a coach and follow-up check-ins.Bloomreachers utilize the $1,500 professional education budget on an annual basis to purchase education products (books, courses, certifications, etc.)*Well-being:The Employee Assistance Program -- with counselors -- is available for non-work-related challenges.*Subscription to Calm - sleep and meditation app.*We organize 'DisConnect' days where Bloomreachers globally enjoy one additional day off each quarter, allowing us to unwind together and focus on activities away from the screen with our loved ones.We facilitate sports, yoga, and meditation opportunities for each other.Extended parental leave up to 26 calendar weeks for Primary Caregivers.*Compensation:Restricted Stock Units or Stock Options are granted depending on a team member's role, seniority, and location.*Everyone gets to participate in the company's success through the company performance bonus.*We offer an employee referral bonus of up to $3,000 paid out immediately after the new hire starts.We reward & celebrate work anniversaries -- Bloomversaries!*(*Subject to employment type. Interns are exempt from marked benefits, usually for the first 6 months.)Excited? Join us and transform the future of commerce experiences!If this position doesn't suit you, but you know someone who might be a great fit, share it - we will be very grateful!Any unsolicited resumes/candidate profiles submitted through our website or to personal email accounts of employees of Bloomreach are considered property of Bloomreach and are not subject to payment of agency fees.#LI-Remote

Bloomreach

Odpovědět