Biography
Welcome! I am a third-year Ph.D. Student at University of California, Merced, advised by amazing Ming-Hsuan Yang. I am also a research intern at Snap, where I am privileged to work with Aliaksandr Siarohin, Sergey Tulyakov, Jun-Yan Zhu, and Kfir Aberman. My research aims at building advanced video generation models with groundbreaking applications. Previously, I did my M.S. and B.S. at National Taiwan University. If you would like to learn more about me, here is my [CV] (updated in Jan 2025) or reach out to me at tsaishienchen [at] gmail.com!
I am honored to receive Graduate Student Opportunity Program Fellowship.
Selected Publications
Check the full pubications list in [CV]
Multi-subject Open-set Personalization in Video Generation
We introduce Video Alchemist, a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization.
arXiv preprint, 2025
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
We introduce Panda-70M, a large-scale video dataset with high-quality automatic caption annotations.
CVPR, 2024
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
We introduce Snap Video, a transformer based Text-to-Video model, allowing us to efficiently train a T2V model with billions of parameters for the first time.
CVPR, 2024
[Highlight, acceptance rate: 2.8%]
Motion-Conditioned Diffusion Model for Controllable Video Synthesis
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes.
arXiv preprint, 2023
Incremental False Negative Detection for Contrastive Learning
We propose IFND which incrementally detects more reliable false negatives and explicitly remove them following the training process of contrastive learning when the embedding space becomes more semantically structural.
ICLR, 2022
Orientation-aware Vehicle Re-identification with Semantics-guided Part Attention Network
We propose SPAN to predict the spatial attention map for each vehicle view given only image-level label for training, and introduce a distance metric emphasizing on the difference in co-occurrence vehicle views.
ECCV, 2020
[Oral, acceptance rate: 2.1%]
Viewpoint-Aware Channel-Wise Attentive Network for Vehicle Re-Identification
We propose VCAM to enable our framework channel-wisely reweighing the importance of each feature map according to the viewpoint of input vehicle image. In addition, we explore the interpretability of how VCAM actually improves the performance of vehicle re-identification.
CVPR Workshops, 2020
Top