AI Research

DeepSeek's Reasoning Memotypes: Could Linguistic Patterns Be Replicating Through Synthetic Data?

Analysis Code: GitHub Gist Introduction In biology, DNA provides the blueprint for how organisms develop and reproduce. In the realm of synthetic reasoning data, we observe a similar phenomenon: specific linguistic patterns that function as “reasoning memotypes”—self-replicating units of thought structure that propagate through synthetic data generation. Analysis of Nvidia’s Nemotron post-training dataset^[1], which contains over 30 million synthetic examples generated using DeepSeek R1^[2] and other reasoning models, reveals systematic linguistic patterns that appear with extraordinary frequency. These patterns function like genetic code for reasoning behavior, encoding not just what models think, but how they structure and express thought itself. ...

Creating 2D Spatial Reasoning Data Without LLMs / VLMs: A Deterministic Curriculum Trial

Quick Links: GitHub Repository | Dataset Sample Introduction Spatial reasoning remains a significant challenge for language models, particularly in tasks requiring 2D navigation and visual-spatial understanding. Current approaches typically rely on either training large vision-language models (VLMs) on visual data or using language models to generate training examples through expensive API calls. This post explores what might be considered an unconventional (and possibly naive) approach: deterministic generation of spatial reasoning data without requiring any LLMs or VLMs in the data creation process. Rather than using models to generate training examples, this experiment algorithmically creates what we hope are realistic learning trajectories that simulate how spatial reasoning competency might develop over time. ...

Teaching AI to Read and Group Like I Bookmark the Web: A Journey into Dynamic Topic Modeling

Quick Links: Dataset on HuggingFace The Topic Modeling Challenge You know that feeling when you have 50 browser tabs open, and you’re desperately trying to organize them into bookmark folders? “ML Papers To Read,” “Funny Cat Videos,” “Recipes I’ll Never Make”… We all have our system. And apparently, it’s such a universal problem that every tech company is launching their own solution - Arc Browser with its “Spaces,” Chrome with its tab groups, and about 500 extensions promising to color-code your digital hoarding habits into submission. ...

Contra-Topic-bottleneck-t5: Efficient Topic Extraction Without the Computational Overhead

Quick Links: Model on HuggingFace | Interactive Demo When it comes to topic extraction, the AI world seems fixated on massive models and expensive compute. But what if there was a simpler way? 🤔 The Genesis: Simplicity Through Linear Transformation Picture this: There I was, looking for an open-source solution to extract topics from text at scale. The available options were either massive language models or complex fine-tuning pipelines. That’s when it hit me – what if we could leverage the semantic structure of existing embeddings with just a linear transformation? ...

LinearCosine: When AI Researchers Decided Multiplication was Too Mainstream

Hey there, optimization seekers and efficiency enthusiasts! 📊🧮 Today, we’re diving into a world where even basic arithmetic operations are up for debate. Buckle up as we explore LinearCosine, an experiment that asks: “Do we really need multiplication for AI?” Quick Links to skip the talk: Project Website - Linear Cosine | GitHub Repo | Original Paper The Paper That Started It All During my fall break, while I was supposed to be relaxing, my roommate Yash Maurya forwarded me a fascinating paper by Hongyin Luo and Wei Sun titled “Addition is All You Need for Energy-efficient Language Models”. I was immediately intrigued by their approach to modify one of the core fundamental computations in AI, multiplication. This project builds upon my previous work on in-browser vanilla js semantic search, such as YC-Dendrolinguistics, where I implemented a cosine similarity-based information retrieval system for YC startups. LinearCosine takes this a step further by exploring ways to make these fundamental calculations more energy-efficient. ...

AdaptKeyBERT: Stumbling Through Two Years of Keyword Extraction

Quick links (in case you want to skip my ramblings): PyPI Package GitHub Repository Alright, gather ‘round, word enthusiasts and syntax sorcerers! 🧙‍♂️📚 Remember that time you tried to explain machine learning to your grandma and ended up comparing neural networks to her knitting patterns? Well, buckle up, because we’re about to dive into a similar realm of “What was I thinking?” – the saga of AdaptKeyBERT. ...

YC-Dendrolinguistics: Planting Linguistic Trees in the Startup Forest

Hey there, fellow AI adventurers and startup enthusiasts! 🌳🚀 Today, I’m excited to give you a peek into my latest passion project: YC-Dendrolinguistics. Buckle up as we embark on a journey through the linguistic forests of Y-Combinator pitches! The Seed of an Idea Picture this: It’s 2 AM, I’m knee-deep in YC application videos, and suddenly it hits me – what if startup pitches are like trees? 🤔 Each word a branch, each phrase a limb, growing into this complex organism we call a pitch. That’s when YC-Dendrolinguistics was born, my wild attempt to map the DNA of startup communication. ...

Synaptic Sparks: Why I'm Wiring My Thoughts into a Neural Blogosphere

Hey there, fellow AI enthusiasts and curious minds! 🧠🤖 Today, I just want to document what’s leading to this new adventure in regular blogging. The Knowledge Synapse Picture me back in 2019, a wide-eyed novice bouncing around the vast landscape of machine learning. I was devouring every GitHub gist, Medium post, and arXiv paper I could find, growing and learning at a dizzying pace. Fast forward to today, and it feels like I’ve stepped into an alternate universe. So much of that knowledge that shaped me is now locked behind paywalls, long arduous youtube playlists, feeling almost alien to the very person who spent countless hours absorbing it. ...

FRACTURED-SORRY-Bench: Unraveling AI Safety through Decomposing Malicious Intents

Hello, fellow AI enthusiasts! 🤖 Today, I wanted to dive into the FRACTURED-SORRY-Bench framework and dataset we just released. Check out the dataset, website, and github for the dataset! The FRACTURED-SORRY Saga: A Tale of Adaptation and Decomposition Picture this: you’re wandering through the lush collection of prompt-injection and llm-red-teaming papers, marveling at some of the weird and some of the crazier attack mechanisms that have been released recently. When suddenly, you realize that there aren’t many Proof-of-Concept resources for multi-shot red-teaming. That’s essentially the story behind creating FRACTURED-SORRY-Bench. ...