AI Benchmark Engineer | Native Language Specialist - Spanish - Remote
remotemid
via Ashby
About this role
ABOUT THE OPPORTUNITY
We are building a rigorous, verifiable evaluation suite of Terminal-Bench tasks designed to test the limits of large language models on multilingual software challenges. Our goal is to measure multilingual robustness across prompt language effects, non-English data processing, and complex locale/encoding edge cases in terminal workflows.
We are seeking experienced native-speaking software engineers to design, build, and validate these benchmarks. You will create high-signal, high-quality tasks that genuinely test a model's ability to handle multilingual environments without relying on English translation crutches.
Note this is a remote, freelance opportunity
WHAT YOU’LL DELIVER
- Task Engineering: Evaluating Coding Agents.…
What we'd score you on
reqspace match rubricFive dimensions, recruiter-grade. Upload your resume and we'll generate a written explanation of where you fit and where the gaps are.
1
Skills match
For this role: python, r, shell
2
Level fit
This role is mid-level. We check your trajectory against it.
3
Domain experience
Your work in the role's domain matters more than your years total. We weight recent and direct experience.
4
Recency
A skill you used last quarter weighs more than one from five years ago. We grade on recency, not lifetime.
5
Location fit
This role is remote-eligible — we factor in your stated location and time-zone overlap.
Score yourself on this role.
Free · no card · written explanation included
Skills in this role
Pulled from the job description. These are the keywords we'll weight when scoring your fit.
pythonrshell
