After ChatGPT opened its free trial to users in late November 2022, it quickly became popular due to its ability to generate various types of content, including prose, jokes, poetry, and even code, based on user inputs. The user experience was not only intuitive but also felt like interacting with a normal human. According to research by Credit Suisse, ChatGPT had already reached 100 million active users in January 2023, making it the fastest-growing consumer application in history.
The generative AI trend sparked by ChatGPT immediately caught the attention of international giants such as Microsoft and Google, who began integrating related technologies into their own products. For example, Microsoft’s new search engine Bing, which integrated ChatGPT, caused the company’s stock price to soar more than 4% overnight after its release, resulting in a market value increase of over 80 billion USD. This is a clear demonstration of the influence of generative AI.
Over the past decade, various industries have actively utilized AI and achieved excellent results. However, the success of ChatGPT in developing generative AI using Large Language Models (LLMs) has exceeded expectations and has pushed AI into the next era of AI 2.0.
The barriers to entry into the AI 2.0 era
To enter the era of AI 2.0, there are several hurdles that need to be overcome in order to master the trend of LLM foundation models, which are specific to one’s own industry or application domain. To activate LLM, businesses first need to be familiar with the technology of distributed training of large-scale models and know how to train a large model simultaneously on different nodes. Pipeline Parallelism (PP), Tensor Parallelism (TP), and Data Parallelism (DP) are three important parameters in the process of training models across nodes. Since the large models and required datasets are quite massive, the memory of a single GPU is insufficient to fully accommodate them. Therefore, appropriate segmentation of the model’s width (TP) and depth (PP), as well as the dataset (DP), is necessary to allow multiple GPUs to efficiently handle the model and dataset together. Optimizing TP, DP, and PP is therefore one of the key factors in training large models effectively.In addition, effective memory management is also a critical factor in training efficiency. In the field of parallel computing, Zero Redundancy technology can effectively manage memory usage and reduce the use of redundant memory. The 1F1B (One Forward One Backward) strategy can also be used to activate memory usage and reduce memory idleness, thereby effectively enhancing training efficiency.
Secondly, it is necessary to have corresponding high computing power to support LLM. As the FLOPs of large models continue to increase, taking GPT-3 175B as an example, the required computing power is as high as 3.64 x 103 Petaflop/s-days. In addition to high computing power, it is also necessary to have an efficient storage system such as GPFS to effectively initiate the training of LLM.
The third threshold for activating LLM is to understand the relevant techniques of fine-tuning and prompt tuning. In-context Learning is a method used to train LLM foundation models by transforming downstream tasks into the model’s prompt input, reducing the storage of model parameters, improving the model’s understanding of tasks, and thereby achieving the ability to generalize and approach human thinking patterns. This changes the way large datasets are learned and moves towards zero-shot or few-shot learning. Prompt tuning is used to fine-tune LLM for specific domains or goals, by devising domain-specific prompt strategies to guide the model in generating text that conforms to the desired style and target. This can speed up the learning process by providing prompt templates that are suitable for the intended use case. Although AI models can generate high-quality content, in some cases, the generated text may not meet user expectations. Prompt tuning can improve the quality of generated content, save time and costs, increase the diversity of content, and improve interaction with users, which is very helpful in improving the practicality and effectiveness of AI-generated content.
The fourth activation threshold is to overcome the challenges of large model inference, as LLM deployment and inference require optimized environments. Since LLM is already too large to be handled by a single GPU, it requires a multi-GPU inference framework to achieve low latency requirements. Additionally, GPU core performance support needs to be enhanced, such as supporting multi-dimensional fusion technology that integrates vertical, horizontal, and memory fusion into one.
The final activation threshold is to prepare a high-performance system environment setup, including computation, networking, and storage, that can work collaboratively to achieve the goal of optimizing the model training environment configuration.
Open-source large language models can help popularize AI 2.0
This shows that the development threshold for LLM is very high, and even international giants like Microsoft and Google cannot activate LLM on their own, which is not a simple task. Therefore, international giants, based on various commercial and other reasons, mostly limit their customers’ access to their complete models.
Fortunately, the BigScience research team, consisting of over a thousand researchers worldwide, trained a language model called BLOOM LLM (BigScience Large Open-science Open-access Multilingual Language Model) using the Jean Zay supercomputer in France for 117 days. It has 176 billion parameters, which is similar to the parameter count/architecture of GPT-3. BLOOM LLM was completed in July 2022 and includes a dataset of 1.5TB containing 46 languages and 13 programming languages, including Spanish, Japanese, German, Chinese, and several Indian and African languages. Its main tasks include article classification, conversation generation, text generation, translation, knowledge answering (semantic search), and article summarization. Users can select a language and request BLOOM to write recipes, translate or summarize, and even write code.
It is worth noting that BLOOM is the first “open-source” large language model. Whether in academia, non-profit organizations, or small and medium-sized enterprises, there is an opportunity to use resources that only a few international giants could access before. However, due to the enormous amount of data and model size of BLOOM, users still need to face the challenges of development and maintenance. Moreover, the lack of training experience and talent has made it even more difficult to activate LLM.
Lambda Labs’ Chief Scientist speculates that training a GPT-3 model requires at least $4.6 million and 355 years to complete, so even though the BLOOM LLM has been open-sourced, most businesses still need the help of information consulting service providers to overcome the AI 2.0 threshold.
AI 2.0 consultancy services help overcome development barriers
Due to the massive size of BLOOM with 1.76 trillion parameters, it cannot be directly trained on any single GPU. Parallel techniques are required to accurately partition the model, optimize TP+DP+PP, and efficiently distribute the training to accelerate the training process. A world-class supercomputer for AIHPC, such as those provided by TWCC, is needed to enable the massive model training of BLOOM and to run it quickly on the AI cloud platform in Taiwan.
Traditional cross-node parallel computing may experience performance degradation as the number of nodes increases. For example, if the computing power of one node is 100, according to linear theory, two nodes should be 200. However, in reality, it may only be 180 due to the decreasing efficiency of communication transmission between nodes.
Due to the cross-node parallel computing environment of TWCC using InfiniBand architecture, it effectively leverages the collaborative operation between nodes. When implementing BLOOM’s execution results, it can achieve near-linear acceleration with cross-node linear performance, providing almost perfect high-performance validation. This can help users fully leverage computational performance, and training time will gradually decrease with increasing node numbers.
Using 105 nodes and 840 GPUs, TWCC precisely segments and distributes the model for massive parallel computing, resulting in excellent training results where each GPU card can run at maximum efficiency. The success of BLOOM’s large model training with TWCC not only helps optimize large model inference systems but also successfully overcomes the challenge of multi-node inference.
Based on the specific achievements of BLOOM as mentioned earlier, Taiwan Intelligent Cloud (TWCC) has also started offering a one-stop integration service called “AI 2.0 High Computing Advisor Service”. This service provides AI experts, AIHPC technical environment resources, large language model (LLM) development services, integrates and optimizes related packages and environments, and helps customers start LLM projects with zero risk. This service can accelerate the process of turning requirements into usable models and applications, and build a dedicated large language model for customers. Companies can reduce huge investments in time, technology costs, development risks, hardware equipment, and manpower investment costs, saving at least millions of dollars in costs and investing each penny in the right place.
● Understand「AI 2.0 Foundation Model Consulting Services」:https://tws.twcc.ai/en/ai-llm/
● Sign up now for the AIHPC x LLM Large Language Model Exhibition on 3/17:https://tws.twcc.ai/2023/02/23/llm2/