HUD (YC W25) is developing agentic evals for Computer Use Agents (CUAs) that browse the web, with a mission to create detailed evaluations for AI agents to ensure they work effectively in the real world.
The company is funded by Y Combinator and a16z, collaborating with frontier AI labs to provide scalable agent evaluation infrastructure.
The role of a research engineer involves building task configurations and environments for evaluation datasets on HUD's CUA evaluation framework.
Responsibilities include creating environments for HUD's CUA evaluation datasets, which encompass safety redteaming, general business tasks, and long-horizon agentic tasks, as well as developing custom CUA datasets and evaluation pipelines in the future.
Requirements:
Proficiency in Python, Docker, and Linux environments is required.
Experience with React for frontend development is necessary.
Production-level software development experience is preferred.
A strong technical aptitude and demonstrated problem-solving ability are essential.
Hands-on experience with LLM evaluation frameworks and methodologies is a plus.
Contributions to evaluation harnesses (such as EleutherAI or Inspect) and experience in building custom evaluation pipelines or datasets are desirable.
Familiarity with agentic or multimodal AI evaluation systems is beneficial.
Candidates should have startup experience in early-stage technology companies and the ability to work independently in fast-paced environments.
Strong communication skills for remote collaboration across time zones are important.
Understanding of safety and alignment considerations in AI systems is advantageous.
Evidence of rapid learning and adaptability in technical environments is preferred.
Benefits:
The position is full-time preferred, with consideration for internship offers.
The role is remote-friendly, with an office available in the San Francisco Bay Area for those who prefer in-person collaboration.
Visa sponsorship and relocation support are provided for strong full-time candidates.
The application process is rolling, typically involving 1-2 interviews and taking less than a week.