AI Benchmark Engineer | Native Language Specialist - Spanish - Remote

remotemid

via Ashby

See if I'm a fit →Tailor my resume for this role →Apply on Ashby ↗

About this role

ABOUT THE OPPORTUNITY We are building a rigorous, verifiable evaluation suite of Terminal-Bench tasks designed to test the limits of large language models on multilingual software challenges. Our goal is to measure multilingual robustness across prompt language effects, non-English data processing, and complex locale/encoding edge cases in terminal workflows. We are seeking experienced native-speaking software engineers to design, build, and validate these benchmarks. You will create high-signal, high-quality tasks that genuinely test a model's ability to handle multilingual environments without relying on English translation crutches. Note this is a remote, freelance opportunity WHAT YOU’LL DELIVER - Task Engineering: Evaluating Coding Agents.…

Read the full description on Lilt-production's site →

What we'd score you on

reqspace match rubric

Five dimensions, recruiter-grade. Upload your resume and we'll generate a written explanation of where you fit and where the gaps are.

Skills match

For this role: python, r, shell

Level fit

This role is mid-level. We check your trajectory against it.

Domain experience

Your work in the role's domain matters more than your years total. We weight recent and direct experience.

Recency

A skill you used last quarter weighs more than one from five years ago. We grade on recency, not lifetime.

Location fit

This role is remote-eligible — we factor in your stated location and time-zone overlap.

Score yourself on this role.

Free · no card · written explanation included

See if I'm a fit →

Skills in this role

Pulled from the job description. These are the keywords we'll weight when scoring your fit.

pythonrshell

More at Lilt-production

See all open jobs at Lilt-production →