Generative AI-Based Multimedia Content Creation Automation System Development Framework

Article Sidebar

Published: Feb 28, 2026

Abstract:

Background: The world's demand for multimedia content is growing rapidly as digital platforms advance. However, the traditional content creation process, which includes ideas, writing, design, and post-production, still tends to be slow, expensive, and difficult to scale. While Generative AI has provided the ability to automatically create text, images, audio, and video, its fragmented use leads to fragmentation of the work process and does not provide a unified quality consistency mechanism.
Aims: This research aims to analyze the progress of Generative AI as well as design a systematic framework that can combine multi-modal models in one automated process for the production of multimedia content as a whole. This framework aims to improve the effectiveness, consistency, and scalability of content creation.
Methods: This study adopts a comprehensive literature review approach to 20 indexed scientific articles (2022–2025) that discuss generative models, multimodal large language models, workflow automation, model evaluation, and integration between modalities. Literature analysis was carried out through thematic synthesis to identify key trends, research gaps, integration challenges, as well as the need for generative model orchestration systems.
Results: The results show that although Generative AI models are undergoing rapid development, including diffusion models, multimodal LLMs, and knowledge-enhanced systems, there is no comprehensive framework that governs task chaining, quality control, and brand consistency in content creation.
Conclusion: This study shows that Generative AI has significant opportunities for content creation automation, but it requires structured integration through a comprehensive framework. The proposed framework offers an integrative structure for text-image-audio-video models, combining automation, style consistency, and quality control in a single workflow.

Keywords: AI, Digital , Framework , Generative , Multimedia

Authors:
1 . Annisa Rohiimah Sufriyani
2 . Athiya Rahma Aulia
3 . Atiqah Najla Fadhilah Rachman
4 . Royana Dwi Rohmah
5 . Ghina Salwa Salsabilla
6 . Mohammad Qais Rezvani
Download
Licensed

Copyright (c) 2026 Annisa Rohiimah Sufriyani, Athiya Rahma Aulia, Atiqah Najla Fadhilah Rachman, Royana Dwi Rohmah, Ghina Salwa Salsabilla, Mohammad Qais Rezvani

Section
Articles

References

Algethami, N., Iqbal, T., & Ullah, I. (2025). Generative AI for biomedical video synthesis : a review. Artificial Intelligence Review, 58(392), 1–50. https://doi.org/10.1007/s10462-025-11394-5

Asyhari, M. F., Dimas, F., Bakar, A. M. A., & Bastian, A. (2025). PENCARIAN CERDAS ANTAR-MODA : EVOLUSI TEKNOLOGI VIDEO-TEXT RETRIEVAL. Jurnal Informatika Dan Teknik Elektro Terapan, 13(3), 69–79. https://doi.org/10.23960/jitet.v13i3.6607

Chen, Z., Zhang, Y., Fang, Y., Geng, Y., Guo, L., Chen, J., Liu, X., Pan, J. Z., Zhang, N., Chen, H., & Zhang, W. (2025). Knowledge Graphs for Multi-modal Learning: Survey and Perspective. Information Fusion, 121, 103124. https://doi.org/10.1016/j.inffus.2025.103124

Dong, A., Wang, L., Liu, J., Lv, G., Zhao, G., & Cheng, J. (2024). MFIFusion: An infrared and visible image enhanced fusion network based on multi-level feature injection. Pattern Recognition, 152, 110445. https://doi.org/10.1016/j.patcog.2024.110445

Dongoran, I. M., Azhar, I. N., Anto, J., & Hakim, D. L. (2022). The Effect of Interactive Multimedia on Student Behavior Against Covid-19 in Vocational High Schools. Education and Humanities Research, 651(Icieve 2021), 130–133. https://doi.org/10.2991/assehr.k.220305.027

Gou, J., Xie, N., Liu, J., Yu, B., Ou, W., G, Z. Y., & Chen, W. (2024). Hierarchical graph augmented stacked autoencoders for multi-view representation learning. Information Fusion, 102, 102068. https://doi.org/10.1016/j.inffus.2023.102068

Hasanuddin, M., & Nurfransiska, F. (2026). Pengaturan Artificial Intelligence ( AI ) Dalam Perspektif Hukum Indonesia : Analisis Normatif Atas Tntangan , Implikasi , Dan Model Regulasi Ideal. Judge : Jurnal Hukum, 06(06), 1890–1897. https://doi.org/10.54209/judge.v6i06.1632

Luo, Y., Chen, E., & Yang, S.-H. (2025). Generative AI in Engineering Education: A Survey of Student and Instructor Usage and Attitudes. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 69(1), 272–277. https://doi.org/10.1177/10711813251358792

Lv, Z. (2023). Generative artificial intelligence in the metaverse era. KeAi Communications, 3(May), 208–217. https://doi.org/10.1016/j.cogr.2023.06.001

Lymperaiou, M., & Stamou, G. (2024). A survey on knowledge ‑ enhanced multimodal learning. In Artificial Intelligence Review (Vol. 57, Issue 10). Springer Netherlands. https://doi.org/10.1007/s10462-024-10825-z

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., & Gao, J. (2025). Large Language Models : A Survey. ArXiv. https://doi.org/10.48550/arXiv.2402.06196

Nugroho, A. Y. (2025). Deep Learning Algorithms in the Development of Generative AI Models for Automated Content Creation. Mutiara : Jurnal Penelitian Dan Karya Ilmiah, 3(5), 111–122. https://doi.org/10.59059/mutiara.v3i5.2804

Rahmani, H. A., & Liu, J. U. N. (2026). AI-Generated Content ( AIGC ) for Various Data Modalities : A Survey. ACM Computing Surveys, 57(9), 1–67. https://doi.org/10.1145/3728633

Sengar, S. S., Bin, A., Sanjay, H., & Fiona, K. (2025). Generative artificial intelligence : a systematic review and applications. Multimedia Tools and Applications, 84, 23661–23700. https://doi.org/10.1007/s11042-024-20016-1

Sun, L., Lian, Z., Liu, B., & Tao, J. (2024). HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition. Information Fusion, 108, 102382. https://doi.org/10.1016/j.inffus.2024.102382

Tang, X., Rohaida, S., Mohamed, B., & Li, Q. (2023). Multimedia use and its impact on the effectiveness of educators: a technology acceptance model perspective. Humanities and Social Sciences Communications, 10(1), 923. https://doi.org/10.1057/s41599-023-02458-4

Tiwari, A., & Misra, M. (2018). Analysis of operative factors and practices in social CRM. International Journal of Digital Enterprise Technology, 1(1), 135–176. https://doi.org/10.1504/IJDET.2018.092639

Yao, X. (2024). Research on Multimodal English Teaching Methods and Practices Leading to Intelligent Generation. Applied Mathematics and Nonlinear Sciences, 9(1), 1–18. https://doi.org/10.2478/amns-2024-1654

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2024). A survey on multimodal large language models. National Science Review, 11(12), n403. https://doi.org/10.1093/nsr/nwae403

Yuan, Y., Li, Z., & Zhao, B. I. N. (2026). A Survey of Multimodal Learning : Methods , Applications , and Future A Survey of Multimodal Learning : Methods , Applications ,. ACM Computing Surveys, 57(7), 1–34. https://doi.org/10.1145/3713070

Yuzyk, O., Honcharuk, V., Pelekh, Y., Bilanych, L., Sirenko, P., Voitovych, I., Roienk, L., Bilanych, H., Makukh, D., Zidens, J., & Yuzyk, M. (2025). Research on Generative Artificial Intelligence Technologies in Education: Opportunities, Challenges, and Ethical Aspects. BRAIN. Broad Research in Artificial Intelligence and Neuroscience, 16(1), 139–151. https://doi.org/10.70594/brain/16.S1/12

Zhang, Z., Li, Z., Wei, K., Pan, S., & Deng, C. (2022). A survey on multimodal-guided visual content synthesis. Neurocomputing, 497, 110–128. https://doi.org/10.1016/j.neucom.2022.04.126

Zhao, M., Wang, W., Zhang, R., Jia, H., & Chen, Q. (2025). TIA2V: Video generation conditioned on triple modalities of text–image–audio. Expert Systems with Applications, 268, 126278. https://doi.org/10.1016/j.eswa.2024.126278