About me

I am a Ph.D. candidate in Computer Science at the University of Texas at Dallas, advised by Prof. Yapeng Tian. My research focuses on multimodal AI, with an emphasis on audio-visual learning, video understanding, and LLM-augmented multimodal systems. I am particularly interested in building robust and agentic AI systems that operate reliably in real-world settings.

Prior to UTD, I received a Bachelor of Science in Mathematics from Sichuan University.

News

Publications

Building robust and agentic multimodal AI systems for real-world understanding and assistance.

Theme A centers on multimodal perception that remains stable under ambiguity, spurious cues, and distribution shift. This line spans robust audio-visual segmentation, open-set adaptation, and benchmark-driven reliability analysis.

Theme B focuses on agentic multimodal reasoning: systems that do not just encode video, but iteratively reason over it using tools, skills, and verification. The goal is structured multimodal decision-making beyond one-shot inference.

Theme C grounds the agenda in egocentric and real-world AI applications, especially online gaze understanding, assistive computing, and multimodal systems that must operate causally and efficiently in deployment settings.

Projects

Resume

Education

  1. University of Texas at Dallas

    2023 — Present

    Artificial Intelligence; Machine Learning; Data Structure and Algorithm ...

  2. Sichuan University

    2018 — 2023

    Probability Theory, General Topology, Functional Analysis ...

Experience

  1. Research Assistant

    2024 — Present

    Working on LLM and HAYSTAC Projects.

  2. Teaching Assistant

    2023 — 2024

    Operation Systems & Discrete Math

My skills

  • Python - data preprocess
    80%
  • Pytorch - Deep learning Architecture
    70%
  • Linux
    90%

publications

Theme A centers on multimodal perception that remains stable under ambiguity, spurious cues, and distribution shift. This line spans robust audio-visual segmentation, open-set adaptation, and benchmark-driven reliability analysis.

Theme B focuses on agentic multimodal reasoning: systems that do not just encode video, but iteratively reason over it using tools, skills, and verification. The goal is structured multimodal decision-making beyond one-shot inference.

Theme C grounds the agenda in egocentric and real-world AI applications, especially online gaze understanding, assistive computing, and multimodal systems that must operate causally and efficiently in deployment settings.