Large language models (LLMs) have become essential tools for a wide array of applications, including natural language processing, translation, text classification, and customer service. However, leveraging the power of these models often involves sending requests to centralized servers, which not only incurs high costs but also consumes significant amounts of energy and time. The demand for faster, more cost-effective, and energy-efficient methods of deploying LLMs has prompted researchers to explore new approaches to reduce the computational load without sacrificing performance.
A breakthrough in this area comes from researchers at Princeton University and Stanford Engineering, who have developed a new algorithm to compress LLMs. The goal of their research, published on the arXiv preprint server, is to make LLMs more accessible for use on consumer-grade devices, like smartphones and laptops, by reducing their size and computational requirements. The new technique could significantly improve privacy, lower energy consumption, and reduce operational costs, making LLMs more practical for everyday applications.
The algorithm, known as CALDERA (Calibration Aware Low precision DEcomposition with low Rank Adaptation), is designed to trim redundancies in LLM data and reduce the precision of the model’s information layers. CALDERA combines two key concepts in a novel way: “low-precision” and “low-rank.” In computational terms, low-precision reduces the number of bits used to represent data, speeding up storage and processing while improving energy efficiency. Low-rank, on the other hand, refers to the process of identifying and removing redundant elements in the model’s weight matrices—large grids of numbers that represent learned patterns in language. By combining these two properties, CALDERA achieves far greater compression than traditional methods that rely on either technique individually.
The algorithm’s effectiveness was demonstrated using open-source LLMs Llama 2 and Llama 3, released by Meta AI. The researchers tested their compression method against benchmark tasks that evaluated the model’s ability to perform language-related tasks such as determining logical sequences of statements and answering questions involving physical reasoning. The results were promising, with CALDERA showing up to a 5% improvement in performance compared to previous methods that used only low-precision compression. This enhancement is significant in the context of LLMs, where even slight improvements in accuracy can have a large impact on practical use cases.
One of the major benefits of CALDERA’s compression technique is its potential to enable LLMs to run on edge devices—devices like smartphones, laptops, and other consumer hardware—without relying on centralized cloud servers. This has multiple advantages. First, it could drastically reduce the time and energy costs associated with accessing LLMs. Rather than sending requests to remote servers, users could perform LLM inference locally on their devices, thus speeding up responses and reducing the energy footprint. Furthermore, the compression of LLMs allows for greater privacy since the model’s computations can be done on-device, avoiding the need to send sensitive data to external servers.
“By compressing the model and running it on edge devices, we can enable privacy-preserving AI,” said Andrea Goldsmith, a co-author of the study and dean of Princeton’s School of Engineering and Applied Science. “Rather than transmitting data to external servers, we can process it locally, keeping sensitive information within the user’s device.” This is a crucial step forward, especially as concerns around data privacy and security become more pronounced in the AI era.
However, there are trade-offs. Running compressed LLMs on consumer devices, while more efficient, could still result in heavy memory usage, which may drain battery life or slow down the device if not carefully managed. Rajarshi Saha, a Stanford Engineering Ph.D. student and co-author of the study, noted that although low-precision computation can help reduce power consumption, it is not a one-size-fits-all solution. “You won’t be happy if you are running an LLM and your phone drains out of charge in an hour,” Saha said, suggesting that additional optimization techniques will be needed to balance performance with power efficiency.
The researchers’ work builds on prior advances in AI model compression, extending these techniques to LLMs. In 2023, Saha and his colleagues had already proposed a method for compressing large datasets and models used in AI training. Initially focused on compressing the massive datasets used to train models, they expanded their work to consider the models themselves as they grew larger and more complex. This evolution in their research led to the development of CALDERA, which is capable of compressing LLMs more efficiently than existing methods.
The team’s success in using CALDERA to compress LLMs could have far-reaching implications for the future of AI. By making LLMs more efficient, accessible, and adaptable, the research could help bring powerful AI tools to a wider audience. Smaller, more efficient models could be deployed in environments where resources are limited, such as mobile devices or even in regions with low internet connectivity, where sending large amounts of data to remote servers is impractical.
Looking ahead, the researchers plan to refine their technique and continue exploring its potential applications. One area of focus will be the fine-tuning of compressed LLMs to ensure they perform optimally on different devices. “We are excited to see how our compression technique can be further adapted and fine-tuned for various use cases, ensuring that LLMs can be used effectively in real-world applications,” said Mert Pilanci, a co-author of the study and assistant professor at Stanford Engineering.
Source: Princeton University