| Job Description: |
Job Summary: We are seeking an experienced AI/ML Ops Application Support Engineer to support, monitor, and maintain AI/ML platforms and applications in a production environment. The role involves ensuring the stability, performance, and reliability of machine learning pipelines, model deployments, and related cloud infrastructure, with a strong focus on operational excellence and incident management.
Key Responsibilities: Provide L2/L3 production support for AI/ML applications, pipelines, and model deployments Monitor model performance, drift, and data quality issues in production environments Troubleshoot and resolve incidents, alerts, and system failures across ML workflows Support CI/CD pipelines for model deployment and versioning Collaborate with Data Scientists, ML Engineers, and DevOps teams for issue resolution and enhancements Manage model retraining schedules, batch/real-time pipelines, and inference jobs Perform root cause analysis (RCA) and implement preventive measures Ensure adherence to SLA/SLO requirements and maintain operational dashboards/reporting Handle application release support, patching, and environment maintenance Maintain documentation for runbooks, troubleshooting guides, and standard operating procedures
Required Skills: Strong experience in production support / application support roles (AI/ML systems preferred) Hands-on experience with Python, SQL, and scripting for troubleshooting Knowledge of ML lifecycle (training, validation, deployment, monitoring) Experience with cloud platforms (Azure/AWS/GCP) Familiarity with ML Ops tools (e.g., MLflow, Kubeflow, SageMaker, Azure ML) Experience with containerization (Docker) and orchestration (Kubernetes) Exposure to CI/CD tools (Jenkins, GitHub Actions, Azure DevOps) Understanding of monitoring tools (Grafana, Prometheus, ELK, Azure Monitor) Strong debugging and incident management skills
Preferred Skills: Experience in healthcare/payer domain (claims, enrollment, analytics platforms) Knowledge of data engineering tools (Spark, Airflow, Databricks) Familiarity with model explainability and governance frameworks ITIL process knowledge (incident, problem, change management)
Education & Experience: Bachelor’s/Master’s degree in Computer Science, Data Science, or related field Typically 5+ years of experience in application/production support, with exposure to AI/ML systems
Role Descriptions: AI/ML OPS apps support Essential Skills: AI/ML Ops Application support Skills: Digital : Python~Agile Specialisation~AI Agents~AI & Gen AI - Products & Tools~Application Server Deployment & Administration Experience Required: 8-10 |