General Knowledge Evaluator (MultiChallenge benchmark)

Remote Full-time

hiring expert evaluators to support a high-impact reasoning evaluation workflow in partnership with a leading AI research lab. This work centers on the MultiChallenge benchmark, which is designed to test large language models (LLMs) on multi-turn conversational reasoning — a capability where even top models fall short today.

This benchmark does not focus on domain-specific expertise, but rather on reasoning consistency, instruction retention, and contextual inference across loosely structured conversations on general topics.

Don't forget to copy and paste the referral link to get to the recruiting platform's next step, which is to register, upload your resume, and take an AI-led interview.

https://work.mercor.com/jobs/list_AAABmHvRFlyxHvS-YFdJCbzl'referralCode=9df2a9e1-2f06-11ef-ae42-42010a400fc4&utm_source=referral&utm_medium=share&utm_campaign=job_referral

What is MultiChallenge?

MultiChallenge is a newly released benchmark targeting reasoning failures that occur in multi-turn interactions between humans and LLMs. It evaluates four categories of failure modes:
• Instruction Retention – Does the model persistently follow instructions across turns?
• Inference Memory – Can it infer or recall relevant user details from earlier conversation history?
• Reliable Versioned Editing – Can it revise content through multi-step iteration without forgetting or hallucinating?
• Self-Coherence – Does it contradict its earlier claims, particularly under user pressure?

The benchmark is designed to surface realistic, high-difficulty conversational reasoning challenges. Despite scoring highly on other multi-turn benchmarks, current frontier models achieve less than 50% accuracy on MultiChallenge.

Who We're Hiring

We are seeking evaluators with strong backgrounds in Logic, Philosophy, or related disciplines — particularly those trained to track argument structure, detect reasoning errors, and evaluate coherence across extended discourse.

This workflow is ideal for individuals with academic experience in:
• Logic
• Analytic Philosophy
• Epistemology
• Formal Semantics
• Cognitive Science
• Linguistics (with a reasoning focus)

Key Responsibilities
• Evaluate the reasoning quality of LLM outputs across 8–10 turn conversations.
• Identify errors in instruction-following, factual coherence, inference, and revision handling.
• Complete evaluations using a structured rubric and short written justifications.
• Work asynchronously using provided tools and examples.

You’re a Strong Fit If You Have:
• A PhD (or are currently a PhD candidate) in Logic, Philosophy, or a closely related field.
• Experience analyzing or writing complex arguments.
• Excellent written communication and generalist reasoning ability.
• Comfort working independently and asynchronously.
• (Optional) Familiarity with Python or LLM evaluation tools is helpful but not required.

Role Details
• Part-time (10–20 hours/week) with flexible scheduling.
• 100% remote and asynchronous — work from anywhere.
• Contractor position via Mercor, paid hourly.
• Competitive rates: $20–$35/hour depending on expertise.
• Weekly payments processed securely through Stripe Connect.

Job Types: Contract, Temporary

Pay: $20.00 - $35.00 per hour

Expected hours: 10 – 20 per week

Work Location: Remote

Apply Now

Apply Now

Apply Now

General Knowledge Evaluator (MultiChallenge benchmark)

Similar Opportunities

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

USPS Office Helper

Experienced Remote Data Entry Specialist – arenaflex

[Remote] Risk Analyst

[Remote] Sr Product Security Engineer(Mobile & Desktop)

People Intelligence and Engagement Partner

[Remote] Remote California Insurance Defense Litigation Attorney ($180K to $200K+ annual compensation)

Medical Assistant-II Virtual Care-Community Medicine

Informaticist (Part-Time, Contract)

Navient ...

Experienced Full Stack Customer Support Specialist – Spanish Speaker – Work From Home Opportunity

Telemetry Monitor Tech — Flexi Nights

General Knowledge Evaluator (MultiChallenge benchmark)

Similar Opportunities

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

USPS Office Helper

**Experienced Remote Data Entry Specialist – arenaflex**

[Remote] Risk Analyst

[Remote] Sr Product Security Engineer(Mobile & Desktop)

People Intelligence and Engagement Partner

[Remote] Remote California Insurance Defense Litigation Attorney ($180K to $200K+ annual compensation)

Medical Assistant-II Virtual Care-Community Medicine

Informaticist (Part-Time, Contract)

Navient ...

**Experienced Full Stack Customer Support Specialist – Spanish Speaker – Work From Home Opportunity**

Telemetry Monitor Tech — Flexi Nights

Experienced Remote Data Entry Specialist – arenaflex

Experienced Full Stack Customer Support Specialist – Spanish Speaker – Work From Home Opportunity