Site Reliability Engineer - AI & ML

Job details

Posted Friday 17 January 2025
Job type Permanent
Discipline Senior SRE
Reference 432
Recruiter Name Geraldine Flanagan

Join an industry leader in Enterprise Technology Management solutions. Their SaaS solution, orchestrates and automates key business processes for IT, with agentless integrations, best practices, and low-code workflows, enabling enterprises to leverage their existing infrastructure systems and automate processes thereby reducing reliance on error-prone manual tasks and tickets. 

We are recruiting an experienced AI & ML Site Reliability Engineer who is passionate about AI, machine learning, and data science to support innovations in AI and Data product management.

In this role, you will

  • be responsible for architecting and maintaining infrastructure that supports machine learning (ML), artificial intelligence (AI), and data-driven solutions.
  • You will help stand up the foundational systems that enable large-scale AI deployment, including developing and managing big data analytics platform, developing AI architecture, implementing vector databases, building knowledge graphs, and optimizing systems for ML model deployment and inference.
  • collaborate closely with data scientists, infrastructure engineers, product management teams, and UX designers to ensure our customers realize meaningful business value by streamlining workflows, ensure scalability, and manage the complete lifecycle of AI systems from development to production.

Qualifications

    • Bachelor’s degree in Computer Science, Engineering, Data Science, or a related field 
    • 5+ years of experience in site reliability engineering, dev ops, ML Ops, or similar role.
    • Experience with cloud platforms such as AWS, GCP, or Azure, including AI/ML services (e.g., SageMaker, Google Colab, Vertex AI).
    • Proficient in deploying machine learning models such as regressions, decision trees, neural networks, recommendations systems, etc., into production and managing model lifecycle

 Technical Skills: 

  • Experience with data processing tools such as Apache Spark, Hadoop, or Airflow for large-scale data processing. Experience with AI/ML tools and frameworks (e.g., TensorFlow, PyTorch, LangChain, Hugging Face).
  • Strong understanding of vector databases (e.g., Pinecone, Milvus, Chroma) and knowledge graph tools (e.g., Neo4j, RDF).
  • Experience with RAG (Retrieval-Augmented Generation) techniques and GraphRAG systems. Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Proficiency in programming languages such as Python, Bash, and experience with ML tools and Libraries.
  •  Experience implementing CI/CD for ML pipelines and working with ML version control systems (e.g., DVC, MLflow).
  • Experience in on-call incident response in high-uptime environments