Senior AI Scientist

Remote Full-time
Job Description:
• Define and own what “good” means for search-augmented and agentic AI systems by designing evaluation frameworks that measure real-world quality, reliability, and user-relevant behavior beyond standard benchmarks.
• Invent and validate novel evaluation methodologies for non-deterministic systems (LLMs, agents, RAG), including behavioral evals, long-tail and adversarial test sets, and task-specific metrics.
• Develop rigorous statistical frameworks for model comparison, regression detection, and uncertainty estimation, ensuring evaluation results are defensible and decision-ready.
• Build and maintain scalable evaluation systems—datasets, gold standards, eval harnesses, scoring pipelines, and analysis tooling—that can be reused across products and customers.
• Lead customer-facing evaluation research, working directly with enterprise customers to translate domain-specific quality requirements into credible, actionable evals that support product decisions and sales outcomes.
• Drive competitive evaluations and internal quality reviews, surfacing meaningful performance differences, trade-offs, and failure modes to inform product strategy and prioritization.
• Partner with engineering and product teams to integrate evals into development loops, release gating, and ongoing quality monitoring.
• Mentor and set standards for evaluation practice, reviewing eval designs, guiding other scientists, and shaping the long-term evals roadmap as systems become more agentic and complex.
• End-to-End Project Leadership: Lead the development of new AI-driven projects, encompassing ideation, prototyping, research, infrastructure design, scalability, monitoring, and evaluation.
• Rapid Iteration: Adapt quickly to user feedback and evolving requirements, ensuring continuous improvement in a fast-paced environment.

Requirements:
• Strong grounding in applied ML and statistics, with experience evaluating non-deterministic AI systems (LLMs, agents, RAG, search).
• Deep experience with AI evaluation, including metric design, gold dataset creation, head-to-head comparisons, slicing, and error analysis.
• Statistical rigor in model comparison, using methods such as paired tests, bootstrap confidence intervals, and robustness analyses.
• Proficiency in Python for evaluation and analysis, including building eval harnesses, data pipelines, scoring logic, and reproducible analysis workflows.
• Ability to translate vague product or customer goals into measurable evaluation criteria, and to challenge metrics or conclusions that don’t reflect real quality.
• Comfort engaging directly with customers and cross-functional stakeholders, explaining evaluation results, trade-offs, and limitations clearly.
• Strong written and verbal communication, including documenting methodologies and contributing to external publications or talks.

Benefits:
• Hubs in San Francisco and New York City offering regular in-person gatherings and co-working sessions
• Flexible PTO with U.S. holidays observed and a week shutdown in December to rest and recharge*
• A competitive health insurance plan covers 100% of the policyholder and 75% for dependents*
• 12 weeks of paid parental leave in the US*
• 401k program, 3% match - vested immediately!*
• $500 work-from-home stipend to be used up to a year of your start date*
• $1,200 per year Health & Wellness Allowance to support your personal goals*
• The chance to collaborate with a team at the forefront of AI research

Apply tot his job

Apply To this Job
Apply Now

Similar Opportunities

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote Full-time

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote Full-time

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote Full-time

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote Full-time

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote Full-time

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote Full-time

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote Full-time

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote Full-time

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote Full-time

USPS Office Helper

Remote Full-time

[Remote] UI/UX Designer :: Remote

Remote Full-time

Investment Partner - US Based Boutique Consulting Firm

Remote Full-time

[Remote] Engineering Manager - AI DevOps

Remote Full-time

Shopify & WooCommerce Developer at Floowi Weston, FL

Remote Full-time

Experienced Full Stack Data Entry and Virtual Assistant Professional – Remote Job Opportunities at blithequark

Remote Full-time

Tax and Technology Analyst

Remote Full-time

**Experienced Data Entry Specialist – Remote Opportunity for Entry-Level Professionals**

Remote Full-time

2026 Summer Intern - Software Engineer - Autonomous Driving - Simulation Team (Master's)

Remote Full-time

Revenue Manager Data Entry Specialist for Vacation Rentals

Remote Full-time

Experienced Customer Service Advisor – Delivering Exceptional Support and Resolution Services to Clients at blithequark

Remote Full-time
← Back to Home