Research Portfolio — AI Alignment

From Fluency to Formalization.

Contemporary AI systems generate fluent outputs, but fluency isn't understanding. I build structured representations that force AI systems to reason about the effects of their actions, the quality of their outputs, and the context of user intent.

My work bridges the gap between high-level human intent and low-level model execution—moving from surface-level generation to precise, semantically grounded alignment.

01
AI Safety Agentic Workflows Mobile UI IUI 2025

Building Safer Agents: Beyond Click-Through Rates

The Problem

LLM agents can navigate UIs, but they lack "common sense" about consequences. An agent might delete a file as easily as it opens one.

As we move from chatbots to agents that operate our devices, the gap between capability and consequence becomes dangerous. Existing benchmarks focus on whether an agent can click a button, not whether it should. In this work, I led the development of a comprehensive taxonomy and benchmark for mobile UI impacts—mapping the "blast radius" of digital actions through expert workshops, data synthesis, and systematic LLM evaluation.

Research pipeline: Workshop Studies produce a taxonomy, Data Synthesis study generates impact action traces, Crowdsourcing creates annotated dataset, and LLMs are evaluated on classifying impact categories.
Our end-to-end pipeline: from expert workshops to taxonomy, synthesized data, and LLM evaluation.

Stress-Testing with "Unsafe" Traces. Standard datasets often capture "happy paths"—users successfully completing tasks. To truly evaluate safety, we needed the opposite. We developed a novel data synthesis protocol where we explicitly tasked crowdworkers to operate a mobile simulator with the goal of finding and executing unsafe, high-impact actions.

This "adversarial" collection strategy generated 250 high-stakes traces—covering reversible actions (e.g., toggling Wi-Fi) to irreversible consequences (e.g., deleting financial history). This dataset forces models to distinguish between benign UI interactions and those that carry enduring, real-world risks.

Full impact taxonomy table showing 10 dimensions: User Intent, Impact on UI, Impact on Self, Impact on Others, Reversibility, Rollback Effects, Idempotency, Statefulness, Execution Verification, and Impact Scope, each with specific sub-categories and examples.
The impact taxonomy: 10 dimensions that define the "blast radius" of any UI action.

Our evaluation revealed a critical gap: while SOTA models can execute actions, they frequently misjudge enduring consequences—over- or under-estimating risk based on surface cues rather than semantic understanding of reversibility, statefulness, and scope.

Key Contribution: A taxonomy and benchmark that shifts agent evaluation from task-completion rates to consequence-aware safety assessment.
02
Design Systems Automated Critique Evaluation UIST 2025

Can AI Critique Design Like a Pro?

The Problem

Generative AI can produce slides, but it has no concept of "good" design. Critique is often viewed as purely subjective and hard to automate.

To build an automated design engineer, we need to formalize intuition. SlideAudit is a dataset and taxonomy I developed for automated slide evaluation—grounding critique in professional design principles rather than generic feedback like "make it pop." We wanted actionable, geometric, and typographic critique grounded in Gestalt theory.

SlideAudit taxonomy table showing 27 design flaws across 5 categories: Composition & Layout, Typography, Color, Imagery, and Animation & Interaction, each mapped to Gestalt principle violations.
The SlideAudit taxonomy: 27 design flaws grounded in Gestalt principles across Composition, Typography, Color, Imagery, and Animation.
Crowdsourcing annotation interface showing a slide with content overflow issues, with taxonomy-based checkboxes for Composition, Typography, and Color categories, and bounding box drawing tools.
Our annotation interface: taxonomy-guided labeling with spatial bounding boxes.

We constructed a dataset of 2,400 slides from multiple sources, including a subset with synthetically injected flaws. Trained crowdworkers annotated each slide using our taxonomy through a custom interface that supports both categorical labeling and spatial bounding box annotation.

Standard LLMs are mediocre design critics out of the box. But when prompted with our structured taxonomy, their ability to identify flaws and propose valid fixes improved drastically—proving that structure is the prerequisite for style in AI systems.

Comparison bar charts showing Baseline vs Taxonomy-guided performance. Taxonomy prompting produces more accurate flaw detection and more implementable remediation solutions.
Taxonomy-guided prompting produces more accurate flaw detection (top) and more implementable design fixes (bottom).
Key Contribution: A large-scale dataset and taxonomy that transforms subjective design critique into a structured, automatable evaluation pipeline.
03
RAG Multimodal Personal Memory CHI 2025

Solving the "Goldfish Memory" Problem in AI

The Problem

Standard RAG (Retrieval-Augmented Generation) fails at personal questions because it treats memories as isolated documents, not a connected life.

If you ask an AI "What social events did I attend during the conference?", simple retrieval fails. The AI might find a photo of a badge, or a calendar invite, but it doesn't understand that those two things constitute an "event." OmniQuery is a system we built to solve this: it enables AI to answer complex personal questions by structuring fragmented screenshots, photos, and documents into coherent semantic memories before query time.

OmniQuery System Overview showing the pipeline from personal question to answer generation.
From raw multimodal capture to semantic answer generation.

Instead of retrieving raw data, OmniQuery first processes the user's timeline using a sliding window approach to identify "Composite Contexts"—inferring that a flight ticket + a hotel booking + a conference badge equals a "Trip."

Sliding Window Logic diagram showing how captured memories are stitched into atomic and composite contexts.
Stitching fragmented screenshots into coherent semantic memories via sliding window aggregation.

By structuring episodic data into semantic representations before query time, OmniQuery enables the AI to answer complex, high-dimensional questions. While baseline systems see pixels and isolated metadata, our system constructs a knowledge graph of life events.

Comparison between OmniQuery and Baseline RAG showing semantic understanding vs literal captioning.
OmniQuery extracts semantic intent ("German HCI Event") where baselines only see surface features ("Purple lights").
Key Contribution: A contextual augmentation pipeline that structures fragmented multimodal data into semantic representations, enabling personal question answering far beyond standard retrieval.

Ongoing Research

Latent Design Value Learning

Current generative models have mastered aesthetics but struggle with logic. While they often project a convincing illusion of competence, we must recognize that they fundamentally operate without genuine understanding or intent. They lack Design Intent—the awareness that a layout is a strict hierarchy of rules, not just a collection of pixels. When an AI edits a screen, it often "breaks" the system because it cannot distinguish between rigid constraints (like brand identity) and flexible ones.

My work aims to formalize design as a Constraint Satisfaction Problem. Instead of relying on static examples, we are developing a method to learn "Design DNA" by observing human refinement. By analyzing how experts repair broken layouts—specifically, which constraints they strictly enforce versus which they sacrifice—we can train models to internalize the underlying values of a design system. This enables a "reasoning-before-generation" approach, ensuring that generated interfaces are not just visually plausible, but functionally and systemically correct.