这项工作以三种不同但互补的贡献:
现有 “文字渲染的不准确” 的问题,主要归因于 Text Encoder 的局限。例如,最初的 CLIP 文本编码器是为概念层面的广泛视觉语言语义对齐而定制的,而 T5/ByT5 文本编码器则侧重于深度语言理解。
然而,尽管最近的研究表明 T5/ByT5 文本编码器有利于视觉文本渲染任务,但两者都没有针对字形图像解释进行明确的微调。缺乏定制的文本编码器设计可能会导致各种应用中的文本渲染不准确。
对应的字形描述:{Text “The way you create a better future is by studying the past.” in [font-color-127], [font-type-234]. Text “Happy Graduation Amber” in [font-color-98] [font-type-231]}.
Character-aware models improve visual text rendering, https://aclanthology.org/2023.acl-long.900/
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
Character-aware models improve visual text rendering, https://aclanthology.org/2023.acl-long.900/
Peidong Jia, Chenxuan Li, Zeyu Liu, Yichao Shen, Xingru Chen, Yuhui Yuan, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, et al. Cole: A hierarchical generation framework for graphic design. arXiv preprint arXiv:2311.16974, 2023.