Article Sidebar
Abstract:
Background: The world's demand for multimedia content is growing rapidly as digital platforms advance. However, the traditional content creation process, which includes ideas, writing, design, and post-production, still tends to be slow, expensive, and difficult to scale. While Generative AI has provided the ability to automatically create text, images, audio, and video, its fragmented use leads to fragmentation of the work process and does not provide a unified quality consistency mechanism.
Aims: This research aims to analyze the progress of Generative AI as well as design a systematic framework that can combine multi-modal models in one automated process for the production of multimedia content as a whole. This framework aims to improve the effectiveness, consistency, and scalability of content creation.
Methods: This study adopts a comprehensive literature review approach to 20 indexed scientific articles (2022–2025) that discuss generative models, multimodal large language models, workflow automation, model evaluation, and integration between modalities. Literature analysis was carried out through thematic synthesis to identify key trends, research gaps, integration challenges, as well as the need for generative model orchestration systems.
Results: The results show that although Generative AI models are undergoing rapid development, including diffusion models, multimodal LLMs, and knowledge-enhanced systems, there is no comprehensive framework that governs task chaining, quality control, and brand consistency in content creation.
Conclusion: This study shows that Generative AI has significant opportunities for content creation automation, but it requires structured integration through a comprehensive framework. The proposed framework offers an integrative structure for text-image-audio-video models, combining automation, style consistency, and quality control in a single workflow.
Keywords: AI, Digital , Framework , Generative , Multimedia
Copyright (c) 2026 Annisa Rohiimah Sufriyani, Athiya Rahma Aulia, Atiqah Najla Fadhilah Rachman, Royana Dwi Rohmah, Ghina Salwa Salsabilla, Mohammad Qais Rezvani

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
References
Algethami, N., Iqbal, T., & Ullah, I. (2025). Generative AI for biomedical video synthesis : a review. Artificial Intelligence Review, 58(392), 1–50. https://doi.org/10.1007/s10462-025-11394-5
Asyhari, M. F., Dimas, F., Bakar, A. M. A., & Bastian, A. (2025). PENCARIAN CERDAS ANTAR-MODA : EVOLUSI TEKNOLOGI VIDEO-TEXT RETRIEVAL. Jurnal Informatika Dan Teknik Elektro Terapan, 13(3), 69–79. https://doi.org/10.23960/jitet.v13i3.6607
Chen, Z., Zhang, Y., Fang, Y., Geng, Y., Guo, L., Chen, J., Liu, X., Pan, J. Z., Zhang, N., Chen, H., & Zhang, W. (2025). Knowledge Graphs for Multi-modal Learning: Survey and Perspective. Information Fusion, 121, 103124. https://doi.org/10.1016/j.inffus.2025.103124
Dong, A., Wang, L., Liu, J., Lv, G., Zhao, G., & Cheng, J. (2024). MFIFusion: An infrared and visible image enhanced fusion network based on multi-level feature injection. Pattern Recognition, 152, 110445. https://doi.org/10.1016/j.patcog.2024.110445
Dongoran, I. M., Azhar, I. N., Anto, J., & Hakim, D. L. (2022). The Effect of Interactive Multimedia on Student Behavior Against Covid-19 in Vocational High Schools. Education and Humanities Research, 651(Icieve 2021), 130–133. https://doi.org/10.2991/assehr.k.220305.027
Gou, J., Xie, N., Liu, J., Yu, B., Ou, W., G, Z. Y., & Chen, W. (2024). Hierarchical graph augmented stacked autoencoders for multi-view representation learning. Information Fusion, 102, 102068. https://doi.org/10.1016/j.inffus.2023.102068
Hasanuddin, M., & Nurfransiska, F. (2026). Pengaturan Artificial Intelligence ( AI ) Dalam Perspektif Hukum Indonesia : Analisis Normatif Atas Tntangan , Implikasi , Dan Model Regulasi Ideal. Judge : Jurnal Hukum, 06(06), 1890–1897. https://doi.org/10.54209/judge.v6i06.1632
Luo, Y., Chen, E., & Yang, S.-H. (2025). Generative AI in Engineering Education: A Survey of Student and Instructor Usage and Attitudes. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 69(1), 272–277. https://doi.org/10.1177/10711813251358792
Lv, Z. (2023). Generative artificial intelligence in the metaverse era. KeAi Communications, 3(May), 208–217. https://doi.org/10.1016/j.cogr.2023.06.001
Lymperaiou, M., & Stamou, G. (2024). A survey on knowledge ‑ enhanced multimodal learning. In Artificial Intelligence Review (Vol. 57, Issue 10). Springer Netherlands. https://doi.org/10.1007/s10462-024-10825-z
Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., & Gao, J. (2025). Large Language Models : A Survey. ArXiv. https://doi.org/10.48550/arXiv.2402.06196
Nugroho, A. Y. (2025). Deep Learning Algorithms in the Development of Generative AI Models for Automated Content Creation. Mutiara : Jurnal Penelitian Dan Karya Ilmiah, 3(5), 111–122. https://doi.org/10.59059/mutiara.v3i5.2804
Rahmani, H. A., & Liu, J. U. N. (2026). AI-Generated Content ( AIGC ) for Various Data Modalities : A Survey. ACM Computing Surveys, 57(9), 1–67. https://doi.org/10.1145/3728633
Sengar, S. S., Bin, A., Sanjay, H., & Fiona, K. (2025). Generative artificial intelligence : a systematic review and applications. Multimedia Tools and Applications, 84, 23661–23700. https://doi.org/10.1007/s11042-024-20016-1
Sun, L., Lian, Z., Liu, B., & Tao, J. (2024). HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition. Information Fusion, 108, 102382. https://doi.org/10.1016/j.inffus.2024.102382
Tang, X., Rohaida, S., Mohamed, B., & Li, Q. (2023). Multimedia use and its impact on the effectiveness of educators: a technology acceptance model perspective. Humanities and Social Sciences Communications, 10(1), 923. https://doi.org/10.1057/s41599-023-02458-4
Tiwari, A., & Misra, M. (2018). Analysis of operative factors and practices in social CRM. International Journal of Digital Enterprise Technology, 1(1), 135–176. https://doi.org/10.1504/IJDET.2018.092639
Yao, X. (2024). Research on Multimodal English Teaching Methods and Practices Leading to Intelligent Generation. Applied Mathematics and Nonlinear Sciences, 9(1), 1–18. https://doi.org/10.2478/amns-2024-1654
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2024). A survey on multimodal large language models. National Science Review, 11(12), n403. https://doi.org/10.1093/nsr/nwae403
Yuan, Y., Li, Z., & Zhao, B. I. N. (2026). A Survey of Multimodal Learning : Methods , Applications , and Future A Survey of Multimodal Learning : Methods , Applications ,. ACM Computing Surveys, 57(7), 1–34. https://doi.org/10.1145/3713070
Yuzyk, O., Honcharuk, V., Pelekh, Y., Bilanych, L., Sirenko, P., Voitovych, I., Roienk, L., Bilanych, H., Makukh, D., Zidens, J., & Yuzyk, M. (2025). Research on Generative Artificial Intelligence Technologies in Education: Opportunities, Challenges, and Ethical Aspects. BRAIN. Broad Research in Artificial Intelligence and Neuroscience, 16(1), 139–151. https://doi.org/10.70594/brain/16.S1/12
Zhang, Z., Li, Z., Wei, K., Pan, S., & Deng, C. (2022). A survey on multimodal-guided visual content synthesis. Neurocomputing, 497, 110–128. https://doi.org/10.1016/j.neucom.2022.04.126
Zhao, M., Wang, W., Zhang, R., Jia, H., & Chen, Q. (2025). TIA2V: Video generation conditioned on triple modalities of text–image–audio. Expert Systems with Applications, 268, 126278. https://doi.org/10.1016/j.eswa.2024.126278