> ABC _

Back to jobs

Research Engineer Roles - AI Training Data

Data Scorecard Remote or Singapore remote full time / internship

Posted 3 Jul 2026. Expires 1 Oct 2026. Submitted by Jen Wei Qing.

  • ai-training-data
  • research-engineering
  • machine-learning
  • data-curation
  • internship

Data Scorecard is hiring research engineers to build the data infrastructure layer for AI training. Both openings focus on making data curation measurable across pre-training and post-training, then turning successful methods into product features.

Open roles

Founding Research Engineer

  • Build metrics for dataset quality, provenance, safety, and coverage that help predict model behaviour.
  • Diagnose data issues such as contamination, uneven multilingual coverage, long-tail gaps, and difficulty mismatches.
  • Design and test curation interventions including pruning, filtering, synthetic augmentation, and relabelling.
  • Turn recent research into practical training and evaluation loops that validate data hypotheses.
  • Ship useful methods as product features and share findings through technical reports or papers.

What they are looking for

  • Strong machine learning and deep-learning fundamentals.
  • Enough software engineering and PyTorch or Jax experience to run ML experiments and build production prototypes.
  • Hands-on experience training or evaluating LLMs or vision-language models.
  • Experience with data curation, pruning and selection, synthetic data, curriculum learning, dataset distillation, or large-scale language or multimodal training.
  • Comfort reading ML research, identifying promising ideas, and implementing them.
  • Ability to drive applied research independently in a fast-moving early-stage environment.

Nice to have

  • Post-training experience such as SFT, preference optimisation, RLVR, or reward modelling.
  • Multilingual or multimodal data work.
  • Multi-GPU or distributed training experience.
  • Open-source or Hugging Face contributions.
  • Public technical writing or published research.

Research Engineer Intern

  • Measure dataset quality, provenance, safety, and coverage metrics that predict downstream model performance.
  • Diagnose where data will hurt a model, including decontamination, difficulty annotation, multilingual asymmetries, and long-tail gaps.
  • Run curation interventions such as filtering, deduplication, and synthetic augmentation, then build eval harnesses to measure their effect.
  • Read recent research, reproduce promising methods, and push them beyond the original paper.
  • Help turn working methods into product features and share findings as a technical report or paper.

What they are looking for

  • Solid Python and working knowledge of the ML stack, including PyTorch or Jax and Hugging Face.
  • Exposure to LLMs or vision-language models through coursework, research, projects, fine-tuning, evaluation, or data pipelines.
  • Comfort reading an ML paper and turning it into a working experiment.
  • Evidence you can build and reason about experiments, such as a repo, paper, project, or competition.
  • Current students, recent grads, and strong self-taught engineers are welcome.

Nice to have

  • Post-training experience such as SFT, preference optimisation, RLVR, or reward modelling.
  • Multilingual or multimodal data work.
  • Synthetic data experience.
  • A shipped side project, tool, or demo people used.
  • Open-source or Hugging Face contributions.
  • Research experience, publications, or technical writing.