Tsinghua Department ChatGLM3 live face demonstration! Multimodality is close to GPT-4V, and the domestic Code Interpreter is coming

2023-10-28 02:32:42

Original source: New Zhiyuan

Image source: Generated by Unbounded AI

The self-developed third-generation pedestal model ChatGLM3 is launched today!

This is another optimization of the ChatGLM base model by the Zhipu AI team since the launch of the second-generation model in June.

In addition, at the 2023 China Computer Conference (CNCC) on October 27, Zhipu AI also open-sourced ChatGLM3-6B (32k), multimodal CogVLM-17B, and agent AgentLM.

After the release of the ChatGLM3 series of models, Zhipu became the only company in China that has benchmarked OpenAI's full model product line.

The generative AI assistant Zhipu Qingyan has also become the first large-scale model product with code interaction capabilities in China.

The model is fully self-developed, adapting to domestic chips, with stronger performance and a more open source ecosystem.

As the first company to enter the large-scale model research, Zhipu AI is the first to submit the paper!

Moreover, Zhipu AI has completed a total of more than 2.5 billion yuan in financing this year, Meituan, Ant, Alibaba, Tencent... The luxurious list of investors all shows the industry's strong confidence in Zhipu AI.

Aiming at GPT-4V's technical upgrade

At present, the multimodal vision model GPT-4V has shown strong image recognition capabilities.

At the same time, aiming at GPT-4V, Zhipu AI has also iteratively upgraded other capabilities of ChatGLM3 this time. Among them, the multimodal comprehension model CogVLM can try to understand and refresh 10+ international standard graphic and text evaluation datasets SOTA. Currently, CogVLM-17B is open sourced.

Code Interpreter can generate and execute code according to user needs, automatically completing complex tasks such as data analysis and file processing.

Web search enhances WebGLM, which can automatically find relevant information on the Internet according to the question, and provide links to reference related literature or articles when answering.

In addition, the semantic and logical capabilities of ChatGLM3 have also been greatly enhanced.

Version 6B Direct Open Source

It is worth mentioning that once ChatGLM3 was released, Zhipu AI directly open-sourced the 6B parameter model to the community.

The evaluation results show that compared with ChatGLM 2 and compared with domestic models of the same size, ChatGLM3-6B ranked first in 9 of the 44 Chinese and English public dataset tests.

MMLU increased by 36%, C by 33%, GSM8K by 179%, and BBH by 126%.

Its open-source 32k version, ChatGLM3-6B-32K, performs best in LongBench.

In addition, it is the latest "efficient dynamic inference + video memory optimization technology" that makes the current inference framework more efficient under the same hardware and model conditions.

Compared with the current best open source implementation, compared with the vLLM launched by the University of Berkeley and the latest version of Hugging Face TGI, the inference speed is increased by 2-3 times, and the inference cost is reduced by 1 time, with only 0.5 points per thousand tokens, which is the lowest cost.

Self-developed AgentTuning, agent ability activation

What's even more surprising is that ChatGLM3 also brings a new agent ability.

Zhipu AI hopes that large models can better communicate with external tools through APIs, and even realize large model interaction through agents.

By integrating the self-developed AgentTuning technology, the intelligent agent capability of the model can be activated, especially in terms of intelligent planning and execution, which is 1000% higher than that of ChatGLM 2.

On the latest AgentBench, ChatGLM3-turbo is close to GPT-3.5.

At the same time, AgentLM is also open to the open source community. What the Zhipu AI team hopes is to make the open-source model reach or even exceed the agent capability of the closed-source model.

This means that the agent will enable the native support of domestic large models for complex scenarios such as "tool calling, code execution, games, database operations, knowledge graph search and inference, and operating systems".

1.5B/3B released at the same time, the mobile phone can run

Want to run ChatGLM on your phone? OK!

This time, ChatGLM3 also launched a terminal test model that can be deployed on mobile phones, with two parameters: 1.5B and 3B.

It can support a variety of mobile phones including Vivo, Xiaomi, Samsung, and in-vehicle platforms, and even supports the inference of CPU chips on mobile platforms, with a speed of up to 20 tokens/s.

In terms of accuracy, the performance of the 1.5B and 3B models is close to that of the ChatGLM2-6B model in the public benchmark evaluation, so go ahead and try it!

A new generation of "Zhipu Qingyan" is fully launched

Just as ChatGPT has a powerful GPT-4 model behind it, the generative AI assistant "Zhipu Qingyan" of the Zhipu AI team is also blessed by ChatGLM3.

After the live broadcast demonstration of this team, the function was directly launched, and the main thing is a sincerity!

Test address:

Code Interpreter

As one of the most popular plugins for ChatGPT, Advanced Data Analysis (formerly Code Interpreter) can analyze problems with more mathematical thinking based on natural language input, and generate appropriate code at the same time.

Now, with the support of the newly upgraded ChatGLM3, "Zhipu Qingyan" has become the first large-scale model product with Advanced Data Analysis capabilities in China, which can support image processing, mathematical computing, data analysis and other use scenarios.

The romance of science and engineering men may only be understood by "Zhipu Qingyan".

Although CEO Zhang Peng performed a live performance to draw a "red heart" overturn, but try it again, and the result came out in seconds.

Similarly, the upgraded ChatGLM3 is also very good at data analysis.

After some analysis, a histogram of the length distribution can be drawn based on the length of the field.

### Search Enhancements

With the addition of WebGLM large model capabilities, "Zhipu Qingyan" now also has the ability to search for enhanced - it can summarize the answers to questions based on the latest information on the Internet, and attach reference links.

For example, the iPhone 15 has recently ushered in a wave of price cuts, how big is the specific fluctuation?

The answer given by "Zhipu Qingyan" is not bad!

### Graphic Understanding

The CogVLM model improves the Chinese image and text comprehension ability of Zhipu Qingyan, and obtains the picture comprehension ability close to GPT-4V.

It can answer various types of visual questions, and can complete complex object detection, labeling, and complete automatic data annotation.

As an example, let CogVLM identify how many people are in the picture.

Add a little difficulty, and then give a picture of three oranges together, and you can also accurately identify the quantity.

Neymar, Messi, Ronaldo, CogVLM is also unambiguous.

For visual math problems where 2 apples and 1 apple are added, CogVLM can also do it right.

GLM vs GPT: Benchmarking OpenAI's full line of products!

From ChatGPT, a chat and conversation application, Code Interpreter, a code generation plugin, to DALL· E 3, and then to the visual multimodal model GPT-4V, OpenAI currently has a complete set of product architecture.

Looking back at China, the only company that can achieve the most comprehensive product coverage is Zhipu AI.

### Conversation: ChatGPT vs. ChatGLM

There is no need to say more about the introduction of the popular fried chicken ChatGPT.

At the beginning of this year, the Zhipu AI team also released ChatGLM, a 100-billion-level dialogue model.

Drawing on the design ideas of ChatGPT, the developers injected code pre-training into the 100 billion base model GLM-130B.

In fact, as early as 2022, Zhipu AI opened GLM-130B to the research community and industry, and this research was also accepted by ACL 2022 and ICLR 2023.

Both the ChatGLM-6B and ChatGLM-130B models were trained on Chinese and English corpora containing 1T tokens, using supervised fine-tuning (SFT), feedback bootstrap, and human feedback reinforcement learning (RLHF).

The ChatGLM model is capable of generating answers that conform to human preferences. Combined with quantization technology, users can deploy locally on consumer-grade graphics cards (only 6GB of video memory is required at the INT4 quantization level), and run their own ChatGLM on laptops based on the GLM model.

On March 14, Zhipu AI open-sourced ChatGLM-6B to the community, and won the first place in the third-party evaluation of Chinese natural language, Chinese dialogue, Chinese Q&A and reasoning tasks.

At the same time, hundreds of projects or applications based on ChatGLM-6B were born.

In order to further promote the development of the large model open source community, Zhipu AI released ChatGLM2 in June, and the 100 billion base dialogue model has been upgraded and open-sourced, including 6B, 12B, 32B, 66B, and 130B different sizes, improving capabilities and enriching scenarios.

ChatGLM 2 ranks first on the Chinese list, as of June 25, 2023, ChatGLM2 ranks in the C- list Rank 0, and ChatGLM2-6B ranks in Rank 6. Compared with the first-generation model, ChatGLM 2 has achieved 16%, 36%, and 280% improvements in MMLU, C-, and GSM8K, respectively.

It is worth mentioning that in just a few months, ChatGLM-6B and ChatGLM2-6B have been widely used.

At present, a total of 50,000+ stars have been collected on GitHub. In addition, there are 10,000,000+ downloads on Hugging Face, ranking first in the four-week trend.

ChatGLM-6B：

ChatGLM2-6B：

Search Enhancements: WebGPT vs. WebGLM

In order to solve the problem of "illusion" of large models, the general solution is to combine the knowledge in the search engine and let the large model carry out "retrieval enhancement".

As early as 2021, OpenAI fine-tuned a model that can aggregate search results based on GPT-3 - WebGPT.

WebGPT models human search behavior, searches in web pages to find relevant answers, and gives citation sources, so that the output results can be traced.

Most importantly, it has achieved excellent results in open domain long Q&A.

Under the guidance of this idea, WebGLM, the "networked version" model of ChatGLM, was born, which is a model based on ChatGLM's 10 billion parameter fine-tuning, and the main focus is network search.

Address:

For example, when you want to know why the sky is blue. WebGLM immediately gives the answer online and includes a link to enhance the credibility of the model's response.

Architecturally, the WebGLM search enhancement system involves three important components: a retriever, a generator, and a scorer.

The LLM-based retriever is divided into two stages, one is coarse-grained network retrieval (search, acquisition, extraction), and the other is fine-grained distillation retrieval.

In the whole process of the retriever, the time is mainly consumed in the process of fetching the web page, so WebGLM uses parallel asynchronous technology to improve efficiency.

The bootstrap generator is the core and is responsible for generating high-quality answers to questions from the reference pages obtained from the retriever.

It uses the contextual inference capabilities of large models to generate high-quality QA datasets, and designs correction and selection strategies to filter out high-quality subsets for training.

The final rater is used to score WebGLM-generated answers through RLHF in order to align with human preferences.

Experimental results show that WebGLM can provide more accurate results and complete Q&A tasks efficiently. Even, it can approach WebGPT with 175 billion parameters with a performance of 10 billion parameters.

At present, this research has been accepted by KDD 2023, and the Zhipu AI team has also open-sourced the capabilities and datasets.

Project Address:

Image and text understanding: GPT-4V vs. CogVLM

In September of this year, OpenAI officially lifted the ban on GPT-4's amazing multimodal capabilities.

GPT-4V, which is supported by this, has a strong ability to understand images and is able to process arbitrarily mixed multimodal inputs.

For example, it can't tell that the dish in the picture is mapo tofu, and it can even give the ingredients for making it.

In October, Zhipu open-sourced a new visual language basic model, CogVLM, which can realize the deep integration of visual language features without sacrificing the performance of any NLP tasks.

Different from common shallow fusion methods, CogVLM incorporates a trainable vision expert module into the attention mechanism and feedforward neural network layer.

This design achieves a deep alignment between image and text features, effectively compensating for the differences between the pre-trained language model and the image encoder.

At present, CogVLM-17B is the model with the first comprehensive score on the multimodal authoritative academic list, and has achieved SOTA or second place results on 14 datasets.

It achieves the best (SOTA) performance across 10 authoritative cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz-VQA, and TDIUC.

The core idea of CogVLM is "visual first".

Previous multimodal models usually align image features directly to the input space of text features, and the encoder of image features is usually small, in this case, the image can be regarded as a "vassal" of the text, and the effect is naturally limited.

CogVLM, on the other hand, prioritizes visual understanding in the multimodal model, using a 5B-parameter vision encoder and a 6B-parameter vision expert module to model image features with a total of 11B parameters, which is even more than the 7B parameter amount of text.

In some tests, CogVLM even outperformed GPT-4V.

There are 4 houses in the picture, 3 are fully visible, and 1 can only be seen if you zoom in.

CogVLM can accurately identify these 4 houses, while GPT-4V can only identify 3.

In this question, pictures with text are tested.

CogVLM faithfully describes the scene and the corresponding text.

### Wensheng Diagram: DALL· E vs. CogView

OpenAI's most powerful Wensheng graph model is DALL· E 3 too.

In contrast, the Zhipu AI team has launched CogView, a Transformer-based text-to-image universal pre-trained model.

Address:

The overall idea of CogView is to perform autoregressive training by splicing text features and image token features. Finally, only the text token feature is inputted, and the model can continuously generate image tokens.

Specifically, the text "The avatar of a cute kitten" is first converted into a token, and the SentencePiece model is used here.

Then an image of a cat is fed in, and the image part is converted into a token through a discrete automatic decoder.

Then, the text and image token features are stitched together, and then input into the GPT model of the Transformer architecture to learn to generate images.

Finally, after the training is completed, the model will sort the generated results by calculating a Caption Score to select the most matching results during the text-to-image generation task.

Comparison of DALL· E and common GAN schemes, the results of CogView have been greatly improved.

In 2022, the researchers upgraded the Wensheng graph model CogView2 again, and the effect was directly compared to DALL· E2。

Address:

Compared with CogView, the architecture of CogView2 adopts hierarchical transfomer and parallel autoregressive mode for image generation.

In the paper, the researchers pre-trained a 6 billion parameter Transformer model, the Cross-Modal General Language Model (CogLM), and fine-tuned it to achieve fast super-resolution.

THE EXPERIMENTAL RESULTS SHOWED THAT THE RELATIONSHIP WITH DALL· E2 also has the advantage of generating results with CogView2 and can also support interactive text-guided editing of images.

In November of the same year, the team built a text-to-video generation model, CogVideo, based on the CogView2 model.

The model architecture is divided into two modules: the first part is based on CogView2 and generates several frames of images from text. The second part is to interpolate the image based on the two-way attention model to generate a complete video with a higher frame rate.

At present, all of the above models are open source. Are the teams from Tsinghua so direct and sincere?

Code: Codex vs. CodeGeeX

In the field of code generation, OpenAI released a new and upgraded Codex as early as August 2021, and is proficient in more than 10 programming languages including Python, Java, Go, Perl, PHP, Ruby, Swift, Type, and even Shell.

Address:

Users can simply give a simple prompt and have Codex write code automatically in natural language.

Codex is trained on GPT-3, and the data contains billions of lines of source code. In addition, Codex can support contextual information that is more than 3 times longer than GPT-3.

As a pioneer in China, Zhipu open-sourced CodeGeeX, a pre-trained model for code generation, translation and interpretation of multi-programming languages with 13 billion parameters, in September 2022, and was later accepted by KDD 2023 (Long Beach).

Address:

In July 2023, Zhipu released a stronger, faster, and lighter CodeGeeX2-6B, which can support more than 100 languages, and the weight is completely open to academic research.

Project Address:

CodeGeeX2 is based on the new ChatGLM2 architecture and is optimized for a variety of programming-related tasks, such as code auto-completion, code generation, code translation, cross-file code completion, and more.

Thanks to the upgrade of ChatGLM2, CodeGeeX2 can not only better support Chinese and English input, as well as a maximum sequence length of 8192, but also greatly improve various performance indicators - Python +57%, C++ +71%, Java +54%, Java +83%, Go +56%, Rust +321%.

In the Human review, CodeGeeX2 comprehensively surpassed the 15 billion parameter StarCoder model, as well as OpenAI's Code-Cushman-001 model (the model used by GitHub Copilot).

In addition, the inference speed of CodeGeeX2 is also faster than that of the first-generation CodeGeeX-13B, which only needs 6GB of video memory to run after quantization, and supports lightweight localized deployment.

At present, the CodeGeeX plug-in can be downloaded and experienced in mainstream IDEs such as VS Code, IntelliJ IDEA, PyCharm, GoLand, WebStorm, and Android Studio.

Domestic large model is fully self-developed

At the conference, Zhang Peng, CEO of Zhipu AI, threw out his own opinion at the beginning - the first year of the large model was not in the year when ChatGPT triggered the LLM boom, but in 2020, when GPT-3 was born.

At that time, Zhipu AI, which had just been established for one year, began to use the power of the whole company to ALL in large models.

As one of the first companies to enter the large-scale model research, Zhipu AI has accumulated sufficient enterprise service capabilities; As one of the "first companies to eat crabs" on open source, ChatGLM-6B topped the Hugging face trend list within four weeks of its launch, and won 5w+ stars on GitHub.

The release of ChatGLM3 makes the full-model product line that Zhipu AI has built more powerful.

In 2023, when the war is raging in the large model industry, Zhipu AI once again stands in the spotlight and occupies the first-mover advantage with the newly upgraded ChatGLM3.

Resources:

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes