The use of generative AI is growing rapidly, being adopted across numerous industries and reaching more users worldwide. As generative AI models become more complex and larger in scale, it is crucial that efforts are made to minimize their environmental impact. This requires continuous work to reduce energy usage and increase efficiency, maximizing the benefits obtained from resources while minimizing the total resources needed. Building on guidance for making deep learning workloads more sustainable on AWS, this article provides recommendations specific to generative AI workloads. In particular, it offers practical best practices for different customization scenarios like training models from scratch, fine-tuning with full or parameter-efficient techniques, Retrieval Augmented Generation (RAG), and prompt engineering. While the post focuses mainly on large language models (LLMs), most recommendations likely extend to other foundation models as well.
GenerativeAI Lifecycle
A Generative AI lifecycle covers the full process from framing the initial problem to ongoing monitoring of the generative model. The key phases are defining the task, training/customizing the model, deploying for inference, and monitoring performance post-deployment. This allows creating, optimizing, and maintaining accurate generative AI systems.
Generative AI Problem Framing
Clearly define the problem you want to solve (make sure it is a problem that needs Generative AI) and frame it appropriately for a foundational model. Determine the task, data requirements, evaluation metrics, and performance criteria. When embarking on a generative AI project, it is important to align the use of this technology with sustainability goals from the start. Carefully consider the trade-offs between a generative AI solution and less resource-intensive traditional approaches during project scoping. Select energy-efficient hardware and optimize infrastructure, techniques such as using AWS Managed Services like Amazon Bedrock and Amazon Sagemaker, and approaches to minimize the resources needed for training and inference.
Selecting the right foundation model is key to minimizing resource usage when tailoring generative AI systems. First, assess different large language models’ capabilities and limitations using playgrounds. This can reduce customization requirements. Consider domain-specific or multilingual models fitting the use case. Begin with smaller model sizes and context windows, since larger ones need more energy and resources for inference. Specialized compact models can deliver qualitative performance comparable to much bigger models for certain tasks, while requiring fewer training and inference resources. Thoroughly evaluating and starting with small models can substantially lower the compute, data, and energy necessary to responsibly customize generative AI.
Model Training & Customization
When training a large language model from scratch, it is crucial to implement strategies that maximize efficiency and minimize resource usage. Use energy-optimized hardware like AWS Trainium instances specifically designed for high-performance, efficient deep learning training. Enhance training data quality and minimize data needs through scalable curation with SageMaker. By training on a diverse, comprehensive, yet curated dataset, models can achieve precision while reducing compute and storage requirements. Employ distributed training techniques like data, pipeline, and tensor parallelism to run parallel computing across multiple GPUs or instances. This maximizes GPU utilization by splitting training batches into efficient microbatches that keep all devices actively engaged. Following these best practices for silicon, data, and distributed training allows generative AI models to be trained from scratch with optimized use of compute resources and energy.
Customize model for your specific use case through techniques like fine-tuning, prompt engineering, etc. Make sure you define the right customization strategy. There are several strategies to enhance the capacities of your model, ranging from prompt engineering to full fine-tuning. Choose the most suitable strategy based on your specific needs while also considering the differences in resources required for each. For instance, fine-tuning might achieve higher accuracy than prompt engineering but consumes more resources and energy in the training phase. Make trade-offs: by opting for a customization approach that prioritizes acceptable performance over optimal performance, reductions in the resources used by your models can be achieved. The following figure summarizes the environmental impact of LLMs customization strategies.
Model Deployment & Inference
Deploy the trained generative model to production through APIs, apps, etc. Make it available for inference and generate outputs for real-world inputs. To deploy your model and workloads, choose AWS Regions near Amazon renewable energy projects and Region(s) where the grid has a lower published carbon intensity.
To optimize generative AI model inference and deployment for efficiency, use deep learning containers and frameworks like SageMaker, DeepSpeed, and Hugging Face to implement techniques like pruning and quantization that reduce model size and memory usage. Carefully set key inference parameters like temperature and top_k to obtain the most relevant outputs while minimizing prompt tuning iterations and energy consumption. Deploy models on Inf2 instances which offer highly energy-efficient inference for large models, or use batch transform to avoid maintaining continuous infrastructure when real-time response is not needed. Define service level agreements (SLAs) aligned with sustainability goals, trading off some latency for decreased resource usage through asynchronous processing, adjusting availability zones, and response times.
Continuous Monitoring
To track the impact of optimizations for generative AI workloads over time, implement a process of continuous monitoring and improvement. The goal is to utilize provisioned resources fully while minimizing the total resources needed to complete work. Collect metrics on cloud resource utilization from tools like CloudWatch, NVIDIA System Management Interface, and SageMaker Profiler. Monitor key metrics like CPU, GPU, memory, and disk utilization. Combine these system metrics with business metrics to estimate carbon emissions. Consistently gathering metrics on resource and model performance allows quantifying the sustainability improvements from optimizations like efficient hardware, distributed training, model compression, and optimized inference. Continued monitoring also identifies areas requiring further optimization, enabling incremental progress towards the most efficient use of resources. This operational process is essential for aligning generative AI with sustainability goals.
Conclusion
With generative AI models increasing in size, it is crucial that we consider the environmental footprint of these workloads. This post offered guidance on optimizing the compute, storage, and networking resources needed to run generative AI systems on AWS in the most sustainable way. As generative AI is a rapidly evolving field, staying current on the latest courses, research, and tools can reveal new methods for making these workloads greener. By continuously educating ourselves and applying emerging best practices, we can minimize the environmental impact as we develop and deploy ever-larger generative models. The key is to keep sustainability at the forefront and leverage new knowledge to find better ways of training and running these AI systems responsibly. Please refer this blog for more detailed understanding of these steps and how to approach designing your GenAI workloads for energy efficiency.
Article written by Ishneet Dua and Parth Patel of Amazon Web Services