General Knowledge Evaluator (MultiChallenge benchmark)

Remote Full-time
hiring expert evaluators to support a high-impact reasoning evaluation workflow in partnership with a leading AI research lab. This work centers on the MultiChallenge benchmark, which is designed to test large language models (LLMs) on multi-turn conversational reasoning β€” a capability where even top models fall short today.

This benchmark does not focus on domain-specific expertise, but rather on reasoning consistency, instruction retention, and contextual inference across loosely structured conversations on general topics.

Don't forget to copy and paste the referral link to get to the recruiting platform's next step, which is to register, upload your resume, and take an AI-led interview.

https://work.mercor.com/jobs/list_AAABmHvRFlyxHvS-YFdJCbzl'referralCode=9df2a9e1-2f06-11ef-ae42-42010a400fc4&utm_source=referral&utm_medium=share&utm_campaign=job_referral

What is MultiChallenge?

MultiChallenge is a newly released benchmark targeting reasoning failures that occur in multi-turn interactions between humans and LLMs. It evaluates four categories of failure modes:
β€’ Instruction Retention – Does the model persistently follow instructions across turns?
β€’ Inference Memory – Can it infer or recall relevant user details from earlier conversation history?
β€’ Reliable Versioned Editing – Can it revise content through multi-step iteration without forgetting or hallucinating?
β€’ Self-Coherence – Does it contradict its earlier claims, particularly under user pressure?

The benchmark is designed to surface realistic, high-difficulty conversational reasoning challenges. Despite scoring highly on other multi-turn benchmarks, current frontier models achieve less than 50% accuracy on MultiChallenge.

Who We're Hiring

We are seeking evaluators with strong backgrounds in Logic, Philosophy, or related disciplines β€” particularly those trained to track argument structure, detect reasoning errors, and evaluate coherence across extended discourse.

This workflow is ideal for individuals with academic experience in:
β€’ Logic
β€’ Analytic Philosophy
β€’ Epistemology
β€’ Formal Semantics
β€’ Cognitive Science
β€’ Linguistics (with a reasoning focus)

Key Responsibilities
β€’ Evaluate the reasoning quality of LLM outputs across 8–10 turn conversations.
β€’ Identify errors in instruction-following, factual coherence, inference, and revision handling.
β€’ Complete evaluations using a structured rubric and short written justifications.
β€’ Work asynchronously using provided tools and examples.

You’re a Strong Fit If You Have:
β€’ A PhD (or are currently a PhD candidate) in Logic, Philosophy, or a closely related field.
β€’ Experience analyzing or writing complex arguments.
β€’ Excellent written communication and generalist reasoning ability.
β€’ Comfort working independently and asynchronously.
β€’ (Optional) Familiarity with Python or LLM evaluation tools is helpful but not required.

Role Details
β€’ Part-time (10–20 hours/week) with flexible scheduling.
β€’ 100% remote and asynchronous β€” work from anywhere.
β€’ Contractor position via Mercor, paid hourly.
β€’ Competitive rates: $20–$35/hour depending on expertise.
β€’ Weekly payments processed securely through Stripe Connect.

Job Types: Contract, Temporary

Pay: $20.00 - $35.00 per hour

Expected hours: 10 – 20 per week

Work Location: Remote

Apply Now

Apply Now
Apply Now

Similar Opportunities

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote Full-time

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote Full-time

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote Full-time

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote Full-time

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote Full-time

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote Full-time

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote Full-time

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote Full-time

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote Full-time

USPS Office Helper

Remote Full-time

**Experienced Remote Data Entry Specialist – arenaflex**

Remote Full-time

[Remote] Risk Analyst

Remote Full-time

[Remote] Sr Product Security Engineer(Mobile & Desktop)

Remote Full-time

People Intelligence and Engagement Partner

Remote Full-time

[Remote] Remote California Insurance Defense Litigation Attorney ($180K to $200K+ annual compensation)

Remote Full-time

Medical Assistant-II Virtual Care-Community Medicine

Remote Full-time

Informaticist (Part-Time, Contract)

Remote Full-time

Navient ...

Remote Full-time

**Experienced Full Stack Customer Support Specialist – Spanish Speaker – Work From Home Opportunity**

Remote Full-time

Telemetry Monitor Tech β€” Flexi Nights

Remote Full-time
← Back to Home