LLM Evaluation Engineer

Remote Full-time

About the Company

ThirdLaw is building the control layer for AI in the enterprise. As companies rush to adopt LLMs and AI agents, they face new safety, compliance, and operational risks that traditional observability tools were never designed to detect. Metrics like latency or cost don’t capture when a model makes a bad decision, leaks sensitive data, or behaves unpredictably.

We help IT and Security teams answer the foundational question: "Is this OK?"—and take real-time action when it’s not.

Backed by top-tier venture firms and trusted by forward-looking enterprise design partners, we’re building the infrastructure to monitor, evaluate, and control AI behavior in real-world environments—at runtime, where it matters. If you're excited to build systems that help AI work as intended—and stop it when it doesn’t—we’d love to meet you.

About the Role

You’ll build the evaluation layer in the ThirdLaw platform—the part of the system that decides whether an LLM prompt, response, tool call, or agent behavior is acceptable. This includes designing and tuning guardrails, classifiers, and semantic judgment systems that operate in real-time. You'll integrate foundation models, similarity search, rules engines, and prompt templates to power high-precision, low-latency policy enforcement.

This is not an experimental ML research role—it’s a product-critical engineering role. You’ll work with structured trace data, foundation models, and real-world constraints to build AI safety systems that actually ship. If you’ve built LLM-powered tools and care deeply about how they behave, this is your chance to help define what “trustworthy AI” means in enterprise environments.

What You’ll Do
• Design and build real-time evaluation logic that determines whether LLM prompts or outputs violate enterprise policies.
• Implement evaluation strategies using a mix of semantic similarity, foundation model scoring, rule-based systems, and statistical checks.
• Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking).
• Prototype, tune, and productize small language models and prompt templates for classification, labeling, or scoring.
• Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage layers.
• Build tools to observe, debug, and improve evaluator performance across real-world data distributions.
• Define abstractions for reusable evaluation components that can scale across use cases.

Who We're Looking For

Required
• 7+ years of experience in ML systems or AI engineering roles, with at least 1–2 years working directly with LLMs, NLP pipelines, or semantic search.
• Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and how to work with them via APIs or open source.
• Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines.
• Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules.
• Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow.
• Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production.

Nice-to-Have
• Experience with OpenTelemetry, Model Context Protocol (MCP), or structured tracing of multi-agent or multi-model pipelines.
• Experience with red-teaming, AI risk taxonomies, or safety audits for LLM-based systems.
• Based in or willing to spend time in the San Francisco Bay Area for in-person collaboration.

Why Apply?

Our team is small and focused, valuing autonomy and real impact over titles and management. We need strong technical skills, a proactive mindset, and clear written communication, as much of our work is asynchronous. If you're organized, take initiative, and want to work closely with customers to shape our products, you'll fit in well here.

Finally, we pay market cash compensation and generally above-market equity. The compensation package for this role is benchmarked using Carta Total Compensation and reflects real-time market data for our company’s size, this role’s level, and your geographic location. We have well-designed and generous benefits.

https://www.thirdlaw.io/

Apply tot his job

Apply To this Job

Apply Now

LLM Evaluation Engineer

Similar Opportunities

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

USPS Office Helper

Experienced Sales Associate & Customer Service Representative – Upscale Resale Store in Columbus, OH at arenaflex

Experienced Customer Success Specialist – Deliver Exceptional Experiences for Tsunami Express Customers

Experienced Data Entry Specialist – Remote Opportunity for a Growing arenaflex Team

Blog Writer Needed – Rewrite Competitor Articles for Local SEO (3 Posts per Week) - Contract to Hire

Salesforce Commerce Cloud, Demandware Developer

Risk Analyst – Market Risk

[Remote] Remote Customer Service Representative

Gold Jewelry Design Lead (Delhi Division)

VP Lead Investment Banker (Req. #002033) - Full-time

Java Application Developer with AWS Migration - Remote position

LLM Evaluation Engineer

Similar Opportunities

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

USPS Office Helper

**Experienced Sales Associate & Customer Service Representative – Upscale Resale Store in Columbus, OH at arenaflex**

**Experienced Customer Success Specialist – Deliver Exceptional Experiences for Tsunami Express Customers**

**Experienced Data Entry Specialist – Remote Opportunity for a Growing arenaflex Team**

Blog Writer Needed – Rewrite Competitor Articles for Local SEO (3 Posts per Week) - Contract to Hire

Salesforce Commerce Cloud, Demandware Developer

Risk Analyst – Market Risk

[Remote] Remote Customer Service Representative

Gold Jewelry Design Lead (Delhi Division)

VP Lead Investment Banker (Req. #002033) - Full-time

Java Application Developer with AWS Migration - Remote position

Experienced Sales Associate & Customer Service Representative – Upscale Resale Store in Columbus, OH at arenaflex

Experienced Customer Success Specialist – Deliver Exceptional Experiences for Tsunami Express Customers

Experienced Data Entry Specialist – Remote Opportunity for a Growing arenaflex Team