Hover samples • Drag to pan • Scroll to zoom • Click to view
Project Overview
This project provides diverse subsets from large-scale training datasets for language models. We use embedding-based k-means clustering with Snowflake Arctic-embed-xs to select samples that maximize coverage across semantic clusters.
Approach: Re-balancing is applied to imbalanced datasets to reduce over-representation of dominant categories while maintaining coverage of underrepresented ones.
Goal: Offer researchers ready-to-use diverse subsets at multiple scales (50K-1M) across pre-training, instruction-following, and reasoning domains.
Clusters
Loading clusters...
Citation
@misc{priyanshu2025stratifiedllm,
title={{Stratified LLM Subsets: Pre-Training, Instruction-Following, and Reasoning SFT Data at 100K-1M Scale}},
author={Priyanshu, Aman and Vijay, Supriti},
year={2025},
howpublished={\url{https://amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale/}},
note={Available at \url{https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M}}
}