Me
Tsai-Shien Chen
Ph.D. Student at UC Merced
Captured at the beautiful Santa Monica Beach, where I spent my summers of 2023 and 2024. Sending prayers to everyone impacted by the LA wildfires. 🙏

Biography

Welcome! I am a third-year Ph.D. Student at University of California, Merced, advised by amazing Ming-Hsuan Yang. I am also a research intern at Snap, where I am privileged to work with Aliaksandr Siarohin, Sergey Tulyakov, Jun-Yan Zhu, and Kfir Aberman. My research aims at building advanced video generation models with groundbreaking applications. Previously, I did my M.S. and B.S. at National Taiwan University. If you would like to learn more about me, here is my [CV] (updated in Jan 2025) or reach out to me at tsaishienchen [at] gmail.com!

I am honored to receive Graduate Student Opportunity Program Fellowship.

2023 May - Now
Research Intern @ Snap
2022 Aug. - Now
Ph.D. Student @ UC Merced
2019 Sep. - 2022 March
Master Student @ NTU

Selected Publications

Check the full pubications list in [CV]
Multi-subject Open-set Personalization in Video Generation
[ website ] [ arXiv ]
We introduce Video Alchemist, a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization.
arXiv preprint, 2025
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
[ website ] [ arXiv ] [ code ] [ video ] [ slides ] [ poster ]
We introduce Panda-70M, a large-scale video dataset with high-quality automatic caption annotations.
CVPR, 2024
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
[ website ] [ arXiv ] [ video ]
We introduce Snap Video, a transformer based Text-to-Video model, allowing us to efficiently train a T2V model with billions of parameters for the first time.
CVPR, 2024 [Highlight, acceptance rate: 2.8%]
Motion-Conditioned Diffusion Model for Controllable Video Synthesis
[ website ] [ arXiv ]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes.
arXiv preprint, 2023
Incremental False Negative Detection for Contrastive Learning
[ OpenReview ] [ arXiv ] [ slides ] [ poster ]
We propose IFND which incrementally detects more reliable false negatives and explicitly remove them following the training process of contrastive learning when the embedding space becomes more semantically structural.
ICLR, 2022
Orientation-aware Vehicle Re-identification with Semantics-guided Part Attention Network
[ website ] [ arXiv ] [ code ] [ video ] [ slides ]
We propose SPAN to predict the spatial attention map for each vehicle view given only image-level label for training, and introduce a distance metric emphasizing on the difference in co-occurrence vehicle views.
ECCV, 2020 [Oral, acceptance rate: 2.1%]
Viewpoint-Aware Channel-Wise Attentive Network for Vehicle Re-Identification
[ arXiv ] [ video ] [ slides ]
We propose VCAM to enable our framework channel-wisely reweighing the importance of each feature map according to the viewpoint of input vehicle image. In addition, we explore the interpretability of how VCAM actually improves the performance of vehicle re-identification.
CVPR Workshops, 2020

Top