Hero placeholder image

What is DECILE

Data Efficient Learning with Less Data
State of the art AI and Deep Learning are very data hungry. This comes at significant cost including larger resource costs (multiple expensive GPUs and cloud costs), training times (often times multiple days), and human labeling costs and time. Decile attempts to solve this by answering the following question. Can we train state of the art deep models with only a sample (say 5 to 10\%) of massive datasets, while having neglibible impact in accuracy? Can we do this while reducing training time/cost by an order of magnitude, and/or significantly reducing the amount of labeled data required?

Why DECILE?

Addressing critical challenges in modern AI and deep learning

💰

Reduce Training Costs

State-of-the-art deep learning requires expensive GPUs and cloud infrastructure, costing thousands per experiment.

🏷️

Lower Labeling Expenses

Manual data annotation is time-consuming and expensive, often requiring domain experts for quality labels.

⚖️

Handle Noisy Data

Real-world datasets contain noise, outliers, and class imbalances that degrade model performance.

Accelerate Development

Training on massive datasets takes days or weeks, slowing down research iteration and deployment cycles.

Modules



Feature 1 placeholder image
Logo

Reduce end to end training time from days to hours and hours to minutes using coresets and data selection. CORDS implements a number of state of the art data subset selection algorithms and coreset algorithms. Some of the algorithms currently implemented with CORDS include: GLISTER, GradMatchOMP, GradMatchFixed, CRAIG, SubmodularSelection, RandomSelection etc



Feature 2 placeholder image
Logo

DISTIL is a library that features many state-of-the-art active learning algorithms. Implemented in PyTorch, it gives fast and efficient implementations of these active learning algorithms. It allows users to modularly insert active learning selection into their pre-existing training loops with minimal change. Most importantly, it features promising results in achieving high model performance with less amount of labeled data. If you are looking to cut down on labeling costs, DISTIL should be your go-to for getting the most out of your data.



Feature 3 placeholder image
Logo

Summarize massive datasets using submodular optimization



Logo

SPEAR is a python library that reduce data labeling efforts using data programming. It implements several recent approaches such as Snorkel, ImplyLoss, Learning to reweight, etc. In addition to data labeling, it integrates semi-supervised approaches for training and inference.



lorem-ipsum
Logo

Targeted subset selection


ML Efficiency for Large Models (MeLM)

Today's world needs orders of magnitude more efficient ML to address environmental and energy crises, optimize resource consumption and improve sustainability. With the end of Moore's Law and Dennard Scaling, we can no longer expect more and faster transistors for the same cost and power budget.

PI: Ganesh Ramakrishnan

🎯

Optimizing Large Language Models through Singular Vector-Based Fine-Tuning

Advancing parameter-efficient fine-tuning techniques by exploring singular vector-guided updates to adapt large-scale pre-trained models for specific downstream tasks.

  • Parameter Efficiency in Model Fine-Tuning
  • Comparison and Evaluation of PEFT Techniques
  • Task-Specific Sparsity Patterns and Performance
  • Scalability and Adaptation in Large Language Models
🧠

Pathway to Algorithmic Generalization (Memory-Augmented Transformers)

Exploring memory-augmented Transformers (Memformers) as adaptive optimizers by implementing Linear First-Order Optimization Methods (LFOMs).

  • Leveraging Memory Augmentation for Advanced Optimization
  • Comparative Performance Against Classical Optimization Techniques
  • Transformers as Meta-Optimizers
  • Theoretical Foundations and Convergence Analysis
📐

Geodesic Sharpness in Transformers

Advancing symmetry-aware sharpness metrics to improve generalization predictions for Transformer models by leveraging Riemannian geometry.

  • Developing Symmetry-Invariant Sharpness Measures
  • Comparative Analysis of Geodesic Sharpness
  • Evaluating Transformer Symmetries in Attention Mechanisms
  • Potential for Sharpness-Aware Optimization
⚙️

Efficiently Adapting Pre-Trained Models for Multiple Tasks

Investigating task arithmetic as an efficient technique for editing pre-trained models, focusing on adding, combining, or removing task-specific capabilities with minimal interference.

  • Developing Task Arithmetic for Efficient Model Adaptation
  • Investigating Weight Disentanglement Mechanisms
  • Examining Kernel-Based Approaches to Task Localization
  • Understanding the Role of Pre-Training in Task Disentanglement

News


Team




Interns

Research Publications


CORDS

FairPO: Fair Preference Optimization for Multi-Label Learning

Soumen Kumar Mondal, Akshit Varmora, Prateek Chanda, Ganesh Ramakrishnan

In NeurIPS OPT 2025 Workshop in the Proceedings of The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

Unified Wisdom: Harnessing Collaborative Learning to Improve Efficacy of Knowledge Distillation

Durga S, Atharva Abhijit Tambat, Ganesh Ramakrishnan, Pradeep Shenoy

In Proceedings of The Transactions of Machine Learning Research (TMLR 2025)

Integrations: Informed Subset Selection Based Generation for Medical Imaging in Resource Constrained Setting

Bhavik Kanekar, Atharv Savarkar, Ganesh Ramakrishnan, Kshitij S. Jadhav

22nd IEEE International Symposium on Biomedical Imaging (ISBI 2025)

Bayesian Coreset Optimization for Personalized Federated Learning

Prateek Chanda , Shrey Modi, Ganesh Ramakrishnan

International Conference on Learning Representations (ICLR) 2024

Submodularity in data subset selection and active learning

Kai Wei, Rishabh Iyer, Jeff Bilmes

International Conference on Machine Learning (ICML) 2015

Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision

Vishal Kaushal, Rishabh Iyer, Suraj Kothiwade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan

7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019 Hawaii, USA

GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer

35th AAAI Conference on Artificial Intelligence, AAAI 2021

Fast multi-stage submodular maximization

Kai Wei, Rishabh K. Iyer, Jeff A. Bilmes

International Conference on Machine Learning (ICML 2014)

Submodular subset selection for large-scale speech training data

Wei, Kai, et al

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2014

Coresets for Data-efficient Training of Machine Learning Models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec

International Conference on Machine Learning (ICML), July 2020

Coresets for Robust Training of Deep Neural Networks against Noisy Labels

Baharan Mirzasoleiman, Kaidi Cao, Jure Leskovec

InProc. Advances in Neural Information Processing Systems (NeurIPS), 2020


DISTIL

Submodularity in data subset selection and active learning

Kai Wei, Rishabh Iyer, Jeff Bilmes

International Conference on Machine Learning (ICML) 2015

Deep batch active learning by diverse, uncertain gradient lower bounds.

Ash, Jordan T., et al.

8th International Conference on Learning Representations (ICLR), 2020

GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer

In Proceedings of the 35th AAAI Conference on Artificial Intelligence, AAAI 2021

An Interactive Multi-Label Consensus Labeling Model for Multiple Labeler Judgments

Ashish Kulkarni, Narasimha Raju Uppalapati, Pankaj Singh, Ganesh Ramakrishnan

In Proceedings of the 32th AAAI Conference on Artificial Intelligence, AAAI 2018

Learning From Less Data: Diversified Subset Selection and Active Learning in Image Classification Tasks

Vishal Kaushal, Rishabh Iyer, Anurag Sahoo, Khoshrav Doctor, Narasimha Raju, Ganesh Ramakrishnan

In Proceedings of The 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, Hawaii, USA

A New Active Labeling Method for Deep Learning

Dan Wang, Yi Shang

International Joint Conference on Neural Networks (IJCNN), 2014

Deep Bayesian Active Learning with Image Data

Yarin Gal, Riashat Islam, Zoubin Ghahramani

34th International Conference on Machine Learning(ICML), 2017

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener, Silvio Savarese

6th International Conference on Learning Representations (ICLR), 2018

Adversarial Active Learning for Deep Networks: a Margin Based Approach

Melanie Ducoffe, Frederic Precioso

arXiv, 2018.


SUBMODLIB

Bandit Guided Submodular Curriculum for Adaptive Subset Selection

Prateek Chanda, Prayas Agrawal, Saral Sureka, Lokesh Reddy Polu, Atharv Kshirsagar, Ganesh Ramakrishnan

In Proceedings of The The Thirty-Ninth Annual Conference on Neural Information Processing Systems 2025.

A Framework towards Domain Specific Video Summarization

Vishal Kaushal, Sandeep Subramanian, Suraj Kothawade, Rishabh Iyer, Ganesh Ramakrishnan

In Proceedings of The 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, Hawaii, USA.

Demystifying Multi-Faceted Video Summarization: Tradeoff Between Diversity,Representation, Coverage and Importance

Vishal Kaushal, Rishabh Iyer, Anurag Sahoo, Pratik Dubal, Suraj Kothawade, Rohan Mahadev, Kunal Dargan, Ganesh Ramakrishnan

n Proceedings of The 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, Hawaii, USA.

Synthesis of Programs from Multimodal Datasets

Shantanu Thakoor, Simoni Shah, Ganesh Ramakrishnan, Amitabha Sanyal

In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, Louisiana, USA.

Beyond clustering: Sub-DAG Discovery for Categorising Documents

Ramakrishna Bairi, Mark Carman and Ganesh Ramakrishnan

In Proceedings of the 25th International Conference on Information and Knowledge Management (CIKM 2016), Indianapolis, USA

Building Compact Lexicons for Cross-Domain SMT by mining near-optimal Pattern Sets

Pankaj Singh, Ashish Kulkarni, Himanshu Ojha, Vishwajeet Kumar, Ganesh Ramakrishnan,

In Proceedings of the 20th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2016.


SPEAR

Data Programming using Continuous and Quality-Guided Labeling Function

Oishik Chatterjee, Ganesh Ramakrishnan, Sunita Sarawagi

In Proceedings of The Thirty-Fourth AAAI Conferenceon Artificial Intelligence (AAAI 2020), New York, USA.

An Interactive Multi-Label Consensus Labeling Model for Multiple Labeler Judgments

Ashish Kulkarni, Narasimha Raju Uppalapati, Pankaj Singh, Ganesh Ramakrishnan

In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, Louisiana, USA.

Synthesis of Programs from Multimodal Datasets

Shantanu Thakoor, Simoni Shah, Ganesh Ramakrishnan, Amitabha Sanyal

In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, Louisiana, USA.

Comparison between Explicit Learning and Implicit Modeling of Relational Features in Structured Output Spaces

Ajay Nagesh, Naveen Nair and Ganesh Ramakrishnan

In Proceedings of the 23rd International Conference on Inductive Logic Programming (ILP), 2013, Rio De Janerio, Brazil.

Towards Efficient Named-Entity Rule Induction for Customizability

Ajay Nagesh, Ganesh Ramakrishnan, Laura Chiticariu, Rajasekar Krishnamurthy, Ankush Dharkar, Pushpak Bhattacharyya

In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2012, Jeju, Korea.

Rule Ensemble Learning Using Hierarchical Kernels in Structured Output Spaces

Naveen Nair, Amrita Saha, Ganesh Ramakrishnan, Shonali Krishnaswamy

In Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (AAAI), 2012, Toronto, Canada.

What Kinds of Relational Features are Useful for Statistical Learning?

Amrita Saha, Ashwin Srinivasan, Ganesh Ramakrishnan

In Proceedings of the 22nd International Conference on Inductive Logic Programming (ILP), 2012, Dubrovnik

Probing the Space of Optimal Markov Logic Networks for Sequence Labeling

Naveen Nair, Ajay Nagesh, Ganesh Ramakrishnan

In Proceedings of the 22rd International Conference on Inductive Logic Programming (ILP), 2012

Efficient Rule Ensemble Learning using Hierarchical Kernels

Pratik Jawanpuria, Saketha Nath and Ganesh Ramakrishnan

In Proceedings of the 28th International Conference on Machine Learning, 2011

Pruning Search Space for Weighted First Order Horn Clause Satisfiability

Naveen Nair, Chander Jayaraman, Kiran TVS and Ganesh Ramakrishnan

In Proceedings of the 20rd International Conference on Inductive Logic Programming (ILP), Florence, Italy

BET : An Inductive Logic Programming Workbench

Srihari Kalgi, Chirag Gosar, Prasad Gawde, Ganesh Ramakrishnan, Chander Iyer, Kiran T V S, Kekin Gada and Ashwin Srinivasan

In Proceedings of the 20rd International Conference on Inductive Logic Programming (ILP), Florence, Italy

Parameter Screening and Optimisation for ILP using Designed Experiments

Ashwin Srinivasan, Ganesh Ramakrishnan

In the Journal of Machine Learning Research 11 (2010) 3481-3516

An Investigation into Feature Construction to Assist Word Sense Disambiguation

Lucia Specia, Ashwin Srinivasan, Ganesh Ramakrishnan, Sachindra Joshi and Maria das Gracas Volpe Nunes

In Machine Learning 76(1): 109-136 (2009)

Feature Construction using Theory-Guided Sampling and Randomised Search

Sachindra Joshi, Ganesh Ramakrishnan, and Ashwin Srinivasan

In Proceedings of the 18th International Conference on Inductive Logic Programming (ILP 2008), Prague, Czech Republic, September 10-12, 2008

SMART: Submodular Data Mixture Strategy for Instruction Tuning

H S V N S Kowndinya Renduchintala, Sumit Bhatia, Ganesh Ramakrishnan

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024 - Findings)

DictDis: Dictionary Constrained Disambiguation for Improved NMT

Ayush Maheshwari, Preethi Jyothi, Ganesh Ramakrishnan

In Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024 - Findings)

Speeding up NAS with Adaptive Subset Selection

Vishak Prasad C, Colin White, Paarth Jain, Sibasis Nayak, Ganesh Ramakrishnan

In Proceedings of The 2024 International Conference on Automated Machine Learning (AutoML 2024)

Gradient Coreset for Federated Learning

Durga Sivasubramanian, Lokesh Nagalapatti, Rishabh Iyer, Ganesh Ramakrishnan

In Proceedings of The 12th IEEE Winter Conference on Applications of Computer Vision (WACV 2024)

M2IoU: A Min-Max Distance based Loss Function for Bounding Box Regression in Medical Imaging

Kalash Shah, Anurag Kumar Shandilya, Bhavik Kanekar, Akshat Gautam, Pavni Tandon, Ganesh Ramakrishnan, Kshitij Jadhav

In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024

INSITE: labelling medical images using submodular functions and semi-supervised data programming

Akshat Gautam, Anurag Shandilya, Akshit Srivastava, Venkatapathy Subramanian, Ganesh Ramakrishnan, Kshitij Jadhav

In Proceedings of the 21st IEEE International Symposium on Biomedical Imaging (ISBI), 2024

Adaptive mixing of auxiliary losses in supervised learning

Durga Sivasubramanian, Ayush Maheshwari, Pradeep Shenoy, Prathosh AP, Ganesh Ramakrishnan

In Proceedings of The 37th AAAI Conference on Artificial Intelligence (AAAI 2023)

Discrete Continuous Optimization Framework for Simultaneous Clustering and Training in Mixture Models

Parth Vipul Sangani, Arjun Shashank Kashettiwar, Pritish Chakraborty, Bhuvan Reddy Gangula, Durga S, Ganesh Ramakrishnan, Rishabh K Iyer, Abir De

In Proceedings of The International Conference of Machine Learning, ICML 2023

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

HSVNS Kowndinya Renduchintala, Krishnateja Killamsetty, Sumit Bhatia, Milan Aggarwal, Ganesh Ramakrishnan, Rishabh Iyer, Balaji Krishnamurthy

Accepted paper at the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, (Long paper, Findings Track)

AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

KrishnaTeja Killamsetty, Guttu Sai Abhishek, Aakriti, Ganesh Ramakrishnan, Alexandre V. Evfimievski, Lucian Popa, Rishabh K. Iyer

In Proceedings of the Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022)

SPEAR : Semi-supervised Data Programming in Python

Guttu Sai Abhishek, Harshad Ingole, Parth Laturia, Vineeth Dorna, Ayush Maheshwari, Rishabh Iyer, Ganesh Ramakrishnan

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi (Demo paper)

Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi and Ganesh Ramakrishnan

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi (Findings)