Research Portfolio — AI Alignment
From Fluency to Formalization.
Contemporary AI systems generate fluent outputs, but fluency isn't understanding. I build structured representations that force AI systems to reason about the effects of their actions, the quality of their outputs, and the context of user intent.
My work bridges the gap between high-level human intent and low-level model execution—moving from surface-level generation to precise, semantically grounded alignment.
Building Safer Agents: Beyond Click-Through Rates
LLM agents can navigate UIs, but they lack "common sense" about consequences. An agent might delete a file as easily as it opens one.
As we move from chatbots to agents that operate our devices, the gap between capability and consequence becomes dangerous. Existing benchmarks focus on whether an agent can click a button, not whether it should. In this work, I led the development of a comprehensive taxonomy and benchmark for mobile UI impacts—mapping the "blast radius" of digital actions through expert workshops, data synthesis, and systematic LLM evaluation.
Stress-Testing with "Unsafe" Traces. Standard datasets often capture "happy paths"—users successfully completing tasks. To truly evaluate safety, we needed the opposite. We developed a novel data synthesis protocol where we explicitly tasked crowdworkers to operate a mobile simulator with the goal of finding and executing unsafe, high-impact actions.
This "adversarial" collection strategy generated 250 high-stakes traces—covering reversible actions (e.g., toggling Wi-Fi) to irreversible consequences (e.g., deleting financial history). This dataset forces models to distinguish between benign UI interactions and those that carry enduring, real-world risks.
Our evaluation revealed a critical gap: while SOTA models can execute actions, they frequently misjudge enduring consequences—over- or under-estimating risk based on surface cues rather than semantic understanding of reversibility, statefulness, and scope.
Can AI Critique Design Like a Pro?
Generative AI can produce slides, but it has no concept of "good" design. Critique is often viewed as purely subjective and hard to automate.
To build an automated design engineer, we need to formalize intuition. SlideAudit is a dataset and taxonomy I developed for automated slide evaluation—grounding critique in professional design principles rather than generic feedback like "make it pop." We wanted actionable, geometric, and typographic critique grounded in Gestalt theory.
We constructed a dataset of 2,400 slides from multiple sources, including a subset with synthetically injected flaws. Trained crowdworkers annotated each slide using our taxonomy through a custom interface that supports both categorical labeling and spatial bounding box annotation.
Standard LLMs are mediocre design critics out of the box. But when prompted with our structured taxonomy, their ability to identify flaws and propose valid fixes improved drastically—proving that structure is the prerequisite for style in AI systems.
Solving the "Goldfish Memory" Problem in AI
Standard RAG (Retrieval-Augmented Generation) fails at personal questions because it treats memories as isolated documents, not a connected life.
If you ask an AI "What social events did I attend during the conference?", simple retrieval fails. The AI might find a photo of a badge, or a calendar invite, but it doesn't understand that those two things constitute an "event." OmniQuery is a system we built to solve this: it enables AI to answer complex personal questions by structuring fragmented screenshots, photos, and documents into coherent semantic memories before query time.
Instead of retrieving raw data, OmniQuery first processes the user's timeline using a sliding window approach to identify "Composite Contexts"—inferring that a flight ticket + a hotel booking + a conference badge equals a "Trip."
By structuring episodic data into semantic representations before query time, OmniQuery enables the AI to answer complex, high-dimensional questions. While baseline systems see pixels and isolated metadata, our system constructs a knowledge graph of life events.