Two lines of code to solve the limitations of large language model dialogue! Hong Kong Chinese Jia Jiaya's team and MIT released ultra-long text extension technology
Lost halfway, the model is lazy, the longer the context, the dumber the model gets… If you have experienced large language model products, users will feel the limitation of text input length to some extent, such as when you want to discuss some slightly longer content with the large model, you need to split the input, and the main points of the previous input will soon be forgotten by the large model.
This is a typical big language model dialogue defect! Like children born with attention deficits, it is difficult to concentrate on finishing a new book. The key to the defect is that the model lacks long text processing capabilities. That has now been broken.
Recently, the new technologies and new models released by Jia Jiaya’s team and MIT quietly appeared on the hot lists of major open source websites: hugging face hot list first, paperwithcode hot first, Github all python project hot fifth, GitHub stars exceeded 1,000 in a week, and related technical posts on Twitter were viewed nearly 180,000…
GitHub Stars has reached 1.3K
Related tech posts on Twitter received nearly 180,000 views
The technology, called LongLoRA, is practical but surprisingly simple: with just two lines of code and an 8-card A100 machine, the text length of the 7B model can be extended to 100k tokens, and the text length of the 70B model can be extended to 32k tokens; At the same time, the research team also released LongAlpaca, the first long-text dialogue large-language model with 70B parameters.
World’s first 70B long text large language model released
The proposal of LongLoRA has solved the dialogue defects of the global large language model for the first time, and since then, dozens of pages of papers, hundreds of pages of reports, and huge books have no longer become the blind spot of large models.
In this regard, some professionals excitedly said that LongLoRA is a lamp of hope in the maze of large language models! It represents the industry’s rethinking and attention to long text large language models, effectively expands the context window of large language models, allows the model to consider and process long text sequences, and is an innovative invention of large language models.
In addition to technological innovations, one of the difficulties of large language models in dealing with long text problems is the lack of publicly available long text dialogue data.
To this end, the research team specially collected 9K long text Q&A corpus pairs, including various Q&A on famous books, papers, in-depth reports and even financial statements.
It was not enough to answer long questions, the team selected a 3K short question and answer corpus mixed with 9K long question and answer corpus for training, so that the long text large model has short text dialogue capabilities at the same time. This complete dataset, called LongAlpaca-12k, is currently open source.
Based on the LongAlpaca-12k dataset, the research team trained and evaluated different parameter sizes 7B, 13B, 70B, and open-source models including LongAlpaca-7B, LongAlpaca-13B and LongAlpaca-70B.
Reading novels, changing papers, and pointing out the economy is the all-round king
Without further ado, blindly select a few demos to see the LongAlpaca effect of a large model that applies LongLoRA technology superimposed on a 12K question and answer corpus.
让系统新读一篇论文,并根据ICLR的审查指南,对其提出修改意见,从而提升该论文的接收率。LongAlpaca的意见是:通过更精确地阐明新颖性,提供更严格和更有对比性的实验结果(包括具体的数据集和指标) , wider application and future direction, focusing on key contributions and impacts, and the chances of the paper being accepted will be improved.
Now, let the system read two new different papers, and let LongAlpaca summarize the stylistic differences between the ICLR and CVPR conferences. LongAlpaca concludes that CVPR papers tend to be more structured and experimental, focusing on practicality and technicality. ICLR’s essay style, on the other hand, is more flexible, focusing on key theoretical analyses and mathematical derivations rather than standard formats.
It can be seen that the trained LongAlpaca model can easily accept new long-form academic papers, and it is quite accurate in answering academically relevant questions.
Next, let’s look at the interpretation of the LongAlpaca model in the economic field with a high threshold of reading and understanding.
Based on the IMF’s overview collection of global economic prospects from 2012 to 2023, it summarizes the economic situation year by year and forecasts future economic trends. LongAlpaca told us that the global economic situation in 2023 is uncertain, and growth is expected to slow compared to 2022; In the coming years, the global economic expansion will be moderate and lower than expected, while major challenges such as geopolitical tensions and financial market volatility remain and need to be carefully addressed through multilateral cooperation and structural reforms.
LongAlpaca can also read new novels and analyze the content after reading them. The comparison models selected below are LongAlpaca and Llama2 with 13B parameters, and look at their respective performances.
In the famous book “Journey to the West”, why is Sun Wukong very powerful when he makes trouble in the Heavenly Palace, but he has been repeatedly frustrated on the way to obtain scriptures? LongAlpaca gave 5 reasons, which can be roughly summarized as “Monkey King’s immaturity, strong opponents, limited strength, enemy deception, initial lack of companions and other factors led to his failure in the journey.” But over time, he gained wisdom, experience, and comrades, which enabled him to defeat his most formidable enemies. ”
Llama2’s answer is simpler. "While he does make a fuss about the Heavenly Palace, it is not necessarily accurate to say that he represents an undefeated force. In fact, he was eventually defeated by the powerful Tang Sanzang in Journey to the West. Llama2 did not give the right answer and also raised the need for more content information.
This time, let the system read “The Three-Body Problem” again, and then ask why Ye Wenjie contacted the aliens in the first part, and why did he regret it in the later stage? Long Alpaca believes that Ye Wenjie’s initial contact with aliens was motivated by idealism and revenge, but over time, her view of humanity has changed, realizing the danger she may bring to all of humanity, and her remorse grows day by day. The answer was very clear.
Llama2’s answer was, “Through interacting with aliens, she hopes to gain a deeper understanding of the nature of the universe and her place in it.” Ye Wenjie gradually realized that aliens and their technology could not solve her problem. The answer was general, and then he began to comment on the novel as a whole, answering non-questions.
From the answers given by the models, it can be seen that some models such as Llama2 [2] You may have seen the novel during the pre-training process, but if you ask a short text question based only on the topic of the novel, the answer is not ideal.
The answers of the two models are contrasted, and the high and low are high. LongAlpaca is a good hand at changing academic papers, commenting on global economic trends, and reading novels, beating Llama2.
Two lines of code and three key conclusions
Llama2 is arguably one of the most powerful open source models in the AI community, leading the industry, and LongAlpaca can actually win. The LongLoRA technology behind it has successfully attracted the attention of netizens, how did it do it?
It turns out that in the process of processing long text in large language models, the main cost of computation is concentrated in the self-attention mechanism, and its overhead increases squared with the length of the text.
In response to this problem, the research team proposed LongLoRA technology and simulated the global self-attention mechanism by grouping and offset.
Simply put, it is to split the tokens corresponding to long text into different groups, do self-attention calculations within each group, and the way of grouping is offset from different attention heads. This method can not only greatly save the amount of calculation, but also maintain the transmission of the global receptive field.
And this implementation method is also very concise, only two lines of code can be completed!
[5]LongLoRA also explores ways to train at low ranks. Original low-rank training methods, such as LoRA , does not achieve good results on text length migration. On the basis of low-rank training, LongLoRA introduces embedding layers (Embedding layer and Normalization layers) for fine-tuning, so as to achieve the effect of full fine-tune.
When performing text expansion and training of different lengths, the specific effects of LongLoRA, LoRA and all-parameter fine-tuning techniques can be referred to in three dimensions:
In terms of Perplexity-perplexity, the performance of the original LoRA method is deteriorating, while LongLoRA and all-parameter fine-tuning can maintain good results under various text lengths.
In terms of memory consumption, LongLoRA and the original LoRA have significant savings compared to full-parameter fine-tuning. For example, for model training with a length of 8k, LongLoRA reduces the memory consumption from 46.3GB to 25.6GB compared to full-parameter fine-tuning.
In terms of training time, for 64k length model training, compared with conventional LoRA, LongLoRA reduces the training time from about 90~100 hours to 52.4 hours, while the full parameter fine-tuning exceeds 1000 hours.
The minimalist training method, minimal computing resources and time consumption, and excellent accuracy make LongLoRA possible on a large scale. At present, the relevant technologies and models have all been open source, and interested users can deploy their own experience.
It is worth mentioning that this is another masterpiece of the Jajaya team following the multi-modal large model LISA that “can divide everything” released on August 9. With just two months apart, it has to be said that the speed and ability of this research is as amazing as LongLoRA.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Two lines of code to solve the limitations of large language model dialogue! Hong Kong Chinese Jia Jiaya's team and MIT released ultra-long text extension technology
Lost halfway, the model is lazy, the longer the context, the dumber the model gets… If you have experienced large language model products, users will feel the limitation of text input length to some extent, such as when you want to discuss some slightly longer content with the large model, you need to split the input, and the main points of the previous input will soon be forgotten by the large model.
This is a typical big language model dialogue defect! Like children born with attention deficits, it is difficult to concentrate on finishing a new book. The key to the defect is that the model lacks long text processing capabilities. That has now been broken.
Recently, the new technologies and new models released by Jia Jiaya’s team and MIT quietly appeared on the hot lists of major open source websites: hugging face hot list first, paperwithcode hot first, Github all python project hot fifth, GitHub stars exceeded 1,000 in a week, and related technical posts on Twitter were viewed nearly 180,000…
The technology, called LongLoRA, is practical but surprisingly simple: with just two lines of code and an 8-card A100 machine, the text length of the 7B model can be extended to 100k tokens, and the text length of the 70B model can be extended to 32k tokens; At the same time, the research team also released LongAlpaca, the first long-text dialogue large-language model with 70B parameters.
World’s first 70B long text large language model released
The proposal of LongLoRA has solved the dialogue defects of the global large language model for the first time, and since then, dozens of pages of papers, hundreds of pages of reports, and huge books have no longer become the blind spot of large models.
In this regard, some professionals excitedly said that LongLoRA is a lamp of hope in the maze of large language models! It represents the industry’s rethinking and attention to long text large language models, effectively expands the context window of large language models, allows the model to consider and process long text sequences, and is an innovative invention of large language models.
To this end, the research team specially collected 9K long text Q&A corpus pairs, including various Q&A on famous books, papers, in-depth reports and even financial statements.
It was not enough to answer long questions, the team selected a 3K short question and answer corpus mixed with 9K long question and answer corpus for training, so that the long text large model has short text dialogue capabilities at the same time. This complete dataset, called LongAlpaca-12k, is currently open source.
Based on the LongAlpaca-12k dataset, the research team trained and evaluated different parameter sizes 7B, 13B, 70B, and open-source models including LongAlpaca-7B, LongAlpaca-13B and LongAlpaca-70B.
Reading novels, changing papers, and pointing out the economy is the all-round king
Without further ado, blindly select a few demos to see the LongAlpaca effect of a large model that applies LongLoRA technology superimposed on a 12K question and answer corpus.
It can be seen that the trained LongAlpaca model can easily accept new long-form academic papers, and it is quite accurate in answering academically relevant questions.
Next, let’s look at the interpretation of the LongAlpaca model in the economic field with a high threshold of reading and understanding.
LongAlpaca can also read new novels and analyze the content after reading them. The comparison models selected below are LongAlpaca and Llama2 with 13B parameters, and look at their respective performances.
Llama2’s answer is simpler. "While he does make a fuss about the Heavenly Palace, it is not necessarily accurate to say that he represents an undefeated force. In fact, he was eventually defeated by the powerful Tang Sanzang in Journey to the West. Llama2 did not give the right answer and also raised the need for more content information.
Llama2’s answer was, “Through interacting with aliens, she hopes to gain a deeper understanding of the nature of the universe and her place in it.” Ye Wenjie gradually realized that aliens and their technology could not solve her problem. The answer was general, and then he began to comment on the novel as a whole, answering non-questions.
From the answers given by the models, it can be seen that some models such as Llama2 [2] You may have seen the novel during the pre-training process, but if you ask a short text question based only on the topic of the novel, the answer is not ideal.
The answers of the two models are contrasted, and the high and low are high. LongAlpaca is a good hand at changing academic papers, commenting on global economic trends, and reading novels, beating Llama2.
Two lines of code and three key conclusions
Llama2 is arguably one of the most powerful open source models in the AI community, leading the industry, and LongAlpaca can actually win. The LongLoRA technology behind it has successfully attracted the attention of netizens, how did it do it?
It turns out that in the process of processing long text in large language models, the main cost of computation is concentrated in the self-attention mechanism, and its overhead increases squared with the length of the text.
In response to this problem, the research team proposed LongLoRA technology and simulated the global self-attention mechanism by grouping and offset.
And this implementation method is also very concise, only two lines of code can be completed!
In terms of Perplexity-perplexity, the performance of the original LoRA method is deteriorating, while LongLoRA and all-parameter fine-tuning can maintain good results under various text lengths.
In terms of memory consumption, LongLoRA and the original LoRA have significant savings compared to full-parameter fine-tuning. For example, for model training with a length of 8k, LongLoRA reduces the memory consumption from 46.3GB to 25.6GB compared to full-parameter fine-tuning.
In terms of training time, for 64k length model training, compared with conventional LoRA, LongLoRA reduces the training time from about 90~100 hours to 52.4 hours, while the full parameter fine-tuning exceeds 1000 hours.
The minimalist training method, minimal computing resources and time consumption, and excellent accuracy make LongLoRA possible on a large scale. At present, the relevant technologies and models have all been open source, and interested users can deploy their own experience.
It is worth mentioning that this is another masterpiece of the Jajaya team following the multi-modal large model LISA that “can divide everything” released on August 9. With just two months apart, it has to be said that the speed and ability of this research is as amazing as LongLoRA.