Platform • SRE • Automation • AI

Application Engineer (Platform / Product SRE)

Application Engineer (Platform/Product SRE) with 13+ years of experience improving reliability, performance and operational efficiency across large-scale environments. Strong background in cloud platforms (AWS, Kubernetes, OpenShift), infrastructure as code, observability and automation. Completed postgraduate studies in AI & Machine Learning (Texas), providing a solid foundation in applied ML and emerging AI-driven automation patterns. Currently advancing skills in Agentic AI approaches and exploring how they can be safely incorporated into SRE workflows for intelligent alerting, incident assistance and self-healing capabilities.

AWS • Kubernetes • OpenShiftTerraform • Ansible • CI/CDObservability • SLOs • DRAI & ML • Agentic AI (learning)

Quick snapshot

  • 📍 Based in Reading, UK
  • 🏢 Application Engineer (Platform / Product SRE)
  • ☁️ AWS, OpenShift, Kubernetes, Azure DevOps
  • 📊 Observability, SLOs & reliability
  • 🧠 PG in AI & ML (Texas) • Agentic AI learner

About

I'm an Application Engineer working in the Platform/Product SRE space, with over 13 years of experience improving reliability, performance and operational efficiency across large-scale environments. My work spans cloud migrations, Kubernetes and OpenShift platforms, observability, DR automation and SRE best practices.

I enjoy building tools and frameworks that help product teams ship safely and operate confidently: from deployment inspectors and automated dashboards to DR utilities and chaos experiments. I like problems where reliability, automation and clean engineering meet.

With a postgraduate background in AI & Machine Learning from Texas, I'm now focused on how **Agentic AI** patterns can be applied responsibly within SRE teams—for example, in intelligent alert enrichment, guided incident response and self-healing workflows, always with reliability and safety as the first concern.

Skills

Cloud & Platform

  • AWS (EC2, RDS, EKS, VPC, CloudWatch)
  • OpenShift / Kubernetes
  • Azure DevOps (pipelines, repos)
  • Terraform, Ansible, Helm

SRE & Observability

  • SLIs • SLOs • Error Budgets
  • Journey-based monitoring
  • DataDog, ELK / Kibana, Grafana, Prometheus
  • Incident response, RCA, resilience testing

Automation & CI/CD

  • Jenkins, GitHub/GitLab CI, Azure DevOps
  • Python, Shell, Groovy
  • Deployment automation & platform tooling
  • DR runbooks and automation

Foundations & AI Focus

  • Linux systems, networking fundamentals
  • ITIL, Agile ways of working
  • Architecture & design reviews, mentoring
  • PG in AI & ML (Texas) • Agentic AI (learning & exploration)

Experience

Discover Financial Services UK

Application Engineer (Platform / SRE)Nov 2021 – Present • Reading, UK

  • Built Postgres Analyzer automation to improve cost visibility and optimisation across AWS RDS fleets.
  • Developed OCP Deployment Inspector to help product teams validate deployments against platform best practices for resiliency, probes, resources and security.
  • Developed a Disaster Recovery automation framework (traffic flip, validation and reporting), significantly reducing manual effort and improving repeatability.
  • Automated observability dashboards in Kibana for new and existing microservices, standardising logging and reducing onboarding time.
  • Implemented chaos experiments using Chaos Toolkit to validate failover and error-handling behaviour for workloads on the platform.
  • Designed a zero-downtime PostgreSQL upgrade process ensuring continuous replication and workload continuity.
  • Supported PCF → OCP migrations, modernising application deployments, environments and monitoring.
  • Defined SLIs/KPIs aligned with SLOs and contributed to journey-based monitoring for customer-critical flows.
  • Acted as on-call engineer in RRT, leading incident response and driving RCAs to reduce MTTR and prevent repeat issues.
  • Leveraging postgraduate AI/ML knowledge to evaluate feasibility and guardrails for emerging AI-enabled automation opportunities within SRE.
  • Actively learning and experimenting with Agentic AI patterns to identify workflows where intelligent orchestration and operational decision-support can reduce manual toil and improve reliability signals.

Vodafone UK

Senior Site Reliability Engineer / Site Reliability EngineerSep 2016 – Nov 2021

  • Executed large-scale migrations from on-prem infrastructure to AWS using Terraform and automated CI/CD pipelines.
  • Deployed and managed Kubernetes clusters (EKS and on-prem) including automation via Ansible.
  • Built observability dashboards using Prometheus, Splunk, AppDynamics and Datadog to align infra and app metrics.
  • Implemented Azure DevOps pipelines for continuous delivery to AWS workloads.
  • Performed capacity analysis, availability reviews and resilience improvements across environments.
  • Handled incident response, root-cause analysis and post-incident automation to improve service availability.

Early Career — DevOps Consultant / Middleware Specialist

2012 – 2016

  • Delivered CI/CD transformations using Jenkins, Maven and Git to reduce manual deployment effort.
  • Automated Linux server provisioning and configuration using Kickstart, Chef and Ansible.
  • Installed and supported enterprise middleware (WebLogic, WebSphere, Apache) in production environments.
  • Provided proactive troubleshooting and monitoring to maintain platform stability and performance.

Selected Work & Tools

OCP Deployment Inspector

Platform tool that inspects OpenShift namespaces and deployments against best practices for resiliency, probes, resources and image policies, giving teams a clear view of deployment health.

Python • OpenShift • Kubernetes • CI/CD

Postgres Analyzer

Automation to review Postgres instances and configuration across regions, enabling cost optimisation and operational visibility for RDS fleets.

Python • AWS RDS • Reporting

DR Automation Framework

Utility to orchestrate DR activities including traffic flip, validation checks and reporting, reducing manual effort and improving confidence in failover plans.

AWS • Scripting • Runbooks • Automation

Observability Dashboard Factory

Automated creation of Kibana dashboards for services, helping teams get consistent logging views with minimal setup effort.

Kibana / ELK • Automation

Contact

I'm open to conversations around platform engineering, SRE, observability, automation and AI-assisted operations.

📧 bhargav.sutapalli@gmail.com

📍 Reading, United Kingdom