Background: The world's demand for multimedia content is growing rapidly as digital platforms advance. However, the traditional content creation process, which includes ideas, writing, design, and post-production, still tends to be slow, expensive, and difficult to scale. While Generative AI has provided the ability to automatically create text, images, audio, and video, its fragmented use leads to fragmentation of the work process and does not provide a unified quality consistency mechanism.
Aims: This research aims to analyze the progress of Generative AI as well as design a systematic framework that can combine multi-modal models in one automated process for the production of multimedia content as a whole. This framework aims to improve the effectiveness, consistency, and scalability of content creation.
Methods: This study adopts a comprehensive literature review approach to 20 indexed scientific articles (2022–2025) that discuss generative models, multimodal large language models, workflow automation, model evaluation, and integration between modalities. Literature analysis was carried out through thematic synthesis to identify key trends, research gaps, integration challenges, as well as the need for generative model orchestration systems.
Results: The results show that although Generative AI models are undergoing rapid development, including diffusion models, multimodal LLMs, and knowledge-enhanced systems, there is no comprehensive framework that governs task chaining, quality control, and brand consistency in content creation.
Conclusion: This study shows that Generative AI has significant opportunities for content creation automation, but it requires structured integration through a comprehensive framework. The proposed framework offers an integrative structure for text-image-audio-video models, combining automation, style consistency, and quality control in a single workflow.