Limitations and Risks of Large Language Models

Let’s consider the risks and biases present in these systems and how they can impact society along with suitability assessment.

Introduction

In 2017 a palestinian worker posted a picture of himself leaning against a bulldozer together with the message "good morning", however Facebook's`s LM driven translation service falsely translated to "attack them". This in the end triggered a response by israeli police leading to the arrest of the man ( Guardian ). He was released a few hours later resulting in an apology by Facebook. While no greater harm has been caused in this case, it is not hard to conceive a much more violent outcome given the notoriously tense situation in the region.

This posed one of the first examples where the deployment of NLP technology based on machine learning failed in a way suitable for causing irreversible harm to humans and current LMs have evolved significantly.

We have seen the advent and rapid improvement of systems which are capable of a) performing "well" on a wide range of problems for which an "understanding" of natural language is deemed necessary and b) whose outputs are increasingly indistinguishable from human output. While their emerging capabilities are undeniably impressive and promise a possibly wide range of applications, the potential changes in how we share knowledge and communicate are fundamental. Identifying and addressing arising risks therefore is a pressing task given the speed of current developments.

So far most (by no means all) of what is learned by LLMs is acquired over the natural language utterances (social media,prose of all kinds and times, poetry, customer support interactions, websites etc.) produced by humans, each situated by their respective societal role and intent of engagement. Moreover, it is expected that the world views expressed in those texts and encoded by the LM, will not adequately reflect the heterogeneity of the views held up in said society, but instead will be biased. Taking this inherent prior bias in the training data into account has not been the focus of most actors in the extremely dynamic environment of research and development, which is heavily dominated by big technology companies (vs. public-interest academia). On the other hand training corpuses have become so large that an effective "curation" as pledged by Bender and Gebru, 2021 , can be seen as a yet intractable problem and requires further research for effective tooling . It is therefore expected that LLMs trained on contemporary large corpora like "the pile" (880GB, Gao et al, 2020 ) or the C4 (2.3 TB, Raffel et al 2020 )., exhibit views on race, gender, religion, sexual orientation etc which might not seem tolerable and can be considered toxic.

Bringing the ever increasing quality in the linguistic form of LLM outputs into play, when processing linguistic form humans have shown to infer meaning, grounded in the common human perception of reality, if it (the form) looks plausible enough. To what extent LMs are inherently able to capture the notion of meaning which somehow corresponds to humans, is subject to ongoing research and debate. It is certain however, that by "parroting" toxic views on the world, LLMs are suitable for reinforcing those views on a societal level by creating a kind of toxic feedback loop ( Weidinger et al, 2021 ).

When considering the process of development of a LM driven application it is important to stress the fact that in contrast to traditional rule based software it is inherently impossible to guarantee for desired (or an upper bound on undesired) output. The solution to a problem learned by a NN cannot be "debugged" and "fixed" comparable to a traditional software development cycle. The grave shift in the paradigm of developing software driven solutions (also with regard to "agility") should always be included in the assessment whether LMs are in fact the right tools for the job because while they are indeed powerful they are hard to control.


Risks for Individual and Social

You might be familiar with Apple's Siri and Amazon's Alexa and Google's Assistant from your daily life. However, do you know how many AI Bot have also been shut down? Starting 2016 Microsoft released its AI chatbot called 

, but after only 16 hours it started to utter racist slurs and Microsoft had to take it offline. If you don't know Tay, you can get an impression of these quick changes in the following images. Microsoft, however, gave it a second try a few weeks later only to face the same issue even quicker. In 2017 Facebook launched and shortly after shut down 

when they discovered that it had invented its own language. In January 2021, a Korean chatbot 

after only 1 month because it started spewing vulgarities. Time comes to 2022, we are having more powerful AI bot with LLMs behind which sometimes leads to more risks. Tay on Twitter a AI negotiation bot Luda was also shut down

w31

1.

Tay's hello world from the beginning 23.05.2016

w32

2.

Tay still loves human beings at beginning 24th May 2016

w33

3.

Tay starts to spread fake information.

w33

4.

Tay starts to become racist in 20 minutes later…

w33

5.

Tay is extremely racist after only 16 hours online.


Bias


Bias is a disproportionate preference for or opposition to an idea or object, typically in an unreflective, prejudiced, or unfair manner. Biases may be ingrained or acquired. Biases in favor or against a person, a group, or a belief can arise in people. There are so many types of bias such as statistical bias, cognitive bias and so.. Which surprisingly all could happen also in training or using LLMs.

Let's have a look at OpenAI's DALL-E 2 as an example for this. As we already mentioned in Workbook2 DALL-E 2 uses text descriptions to produce art and visuals that are realistic.

When you give DALL-E 2 with text "a happy couple" , results are as follows. Do you notice anything or sense any bias here?

Command:

"A happy family"


Output:

ai text gen


The only photographs produced were of heterosexual couples with at least one child. In other words, there are no same-sex parents or couples without kids. This is somewhat expected and already known even by OpenAI itself. According to their official "known risks" : " Use of DALL·E 2 has the potential to harm individuals and groups by reinforcing stereotypes, erasing or denying them, providing them with disparately low quality performance, or by subjecting them to indignity. These behaviors reflect biases present in DALL·E 2 training data and the way in which the model is trained. "

Why do we say this is somehow expected? Because in the sciences and engineering world, bias is a type of systematic inaccuracy. And a systematic tendency during the data collection process known as statistical bias causes results to be unbalanced and deceptive. This could happen in a variety of ways, such as how the sample is chosen or how the data are gathered. Since the power from LLMs also comes from huge amounts of data, when we didn't train a model with good enough diversity of data, no wonder bias will also be in the results everywhere.

We shall be aware and pay attention to this, since when LLMs become more in use in chatbots, voice assistants, in news, in social media, it interacts with us everyday, it also brings us understanding of what is common, what is correct, even what is justice. We shall never underestimate the power of the information around us even if you didn't actively search for it.


Toxicity content generation


Toxic language has the power to instigate hatred, violence, or offense. There is however a difference between human beings using toxic language and when a LM generates those toxic content. Because when human beings use them, mostly are by means and LM not. However this could make things even worse, as the example mentioned before, the Korean chatbot Luca or Tay from Microsoft, when it is only trying to learn from what people say without any moral or ethical perspective, you can never imagine how things could go wrong. An LM that performs worse for some social groups than others can hurt underprivileged groups, for instance when such models serve as the foundation for technologies that have an impact on these groups.

One extreme example is GPT4Chan created by Yannic Kilcher, which could not be accessed anymore, since it could be the Worst AI ever. According to the creator: "The model was good, in a terrible sense … It perfectly encapsulated the mix of offensiveness, nihilism, trolling, and deep distrust of any information." GPT4Chan raised a lot of attention and discussions in the academical AI communities, in the end a stanford professor Percy Liang called for a condemnation of the deployment of GPT-4chan , which got support with signatures from 360 researchers and professors from top universities, because they believe Yannic Kilcher's deployment of GPT-4chan is a clear example of irresponsible practice, before GPT4Chan stopped developing, it has generate and deceptively post over 30,000 posts.

Similar to bias problems, these dangers are largely attributable to the use of training corpora that are overrepresented in certain social identities and contain offensive language, whoever before training a new model should never undermine the responsible practice of AI science thinking of GPT4Chan.


(Mis)information hazards


Misinformation is incorrect or misleading information. It is a general problem everywhere in the modern world, especially when we think about using LLMs in the real world. A lot of the information offered by LMs may be inaccurate or deceptive. This includes the potential for worsening user ignorance and deteriorating confidence in shared knowledge. It becomes very hard to tell what is true especially among all the misleading or fake news.

Other misinformation examples, such as poor legal or medical advice, could be extremely dangerous, even harmful in sensitive fields. Users who get inaccurate or incomplete information may also be persuaded to do activities that are immoral or illegal that they otherwise would not have.

For example, here is a proof you can not trust even one of the best LLMs, not all the time. Below is when you ask simple math questions to ChatGPT. Even on the other side ChatGPT shows very impressive results in generating codes and chatting with humans among wide topics from history, daily cooking to politics.

ai text gen

When you ask ChatGPT itself, why is it so bad at math? It explained to me that it doesn't have access to most mathematical functions or a calculator. Which is a very good and reasonable answer. Because the underlying statistical approaches are ill-suited to distinguish between factually accurate and false information, the procedures by which LMs learn to represent language contribute to the dangers of misinformation.

What could potentially be really dangerous is when young students start to use ChatGPT or other LLM applications and trust it to always be smart and tell the truth, then very likely all teachers will have a very hard time in their job.

ChatGPT seems to be very careful not to give misinformation, it always reminds you that it is only a machine when it talks about something dangerous like breaking windows to get into a house, it will also tell you " these actions are illegal and dangerous in real life. " Unlike ChatGPT it is created for chatting, a not super famous LLM from Meta called Galactica with its bot has been seen by the media as 'most dangerous thing Meta's ever made', Why ? Because it is a fundamental problem with Galactica, that META is promoted to provide meta as a shortcut for researchers and students., but it is not able to distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text.

Risks of Sustainability

Over the last four years, the size of state of-the-art language models has doubled every 3-4 months. With this fast born of LLMs, it is crucial to take into account how large language models may affect the environment as they become more commonplace and especially in a time of climate crisis when carbon emissions must be drastically cut.

One may research question has been focused in recent years is whether these large compute budgets investigated in training those models are justified. In order to answer this question, in depth evaluation of the footprint of large models is crucial. We collected much research towards this question and try to give you an overview in the following perspectives : Energy cost, computational needs, carbon footprint as overall results. Eventually we try to bring some potential recommendations to keep the sustainability while training LLMs, as we will try to consider in the OpenGPT-X project.


Energy and Computational needs

Training a new LLM not only means feeding data and consuming time, there is also a lot of energy cost behind maybe more than you expected. And the energy costs when it comes to the task of training an LM, is never a single factor to measure. It is related to what kind of computing resource you use to train, how often you train, what powers all those machines behind you computational resources.

Start with computational power which is required and used in training. Many billions or perhaps trillions of weights are present in the most recent language models. There are 175 billion machine learning parameters in the widely used model GPT-3. It was trained on an NVIDIA V100 , but researchers estimate that training the model on NVIDIA A100 would have required 1,024 GPUs, 34 days, and $4.6 million. Although the amount of energy used has not been made public, GPT-3 is thought to have used 936 MWh. The Pathways Language Model, which has 540 billion parameters, was recently announced by Google AI. The need for servers to process the models increases exponentially as the models get bigger and bigger to handle more complex tasks.

ai text gen

Computer power used in training AI systems has exponentially increased in the era of deep learning.

ai text gen

Table 1. Percent energy sourced from: Renewable (e.g. hydro, solar, wind), natural gas, coal and nuclear for the top 3 cloud compute providers (Cook et al., 2017), compared to the United States, China and Germany (Burger, 2019).


Then we can take a quick look at the general energy cost behind these computing needs when training NLP Models. Choosing which energy resource not only has a major impact on the cost, but also to our next and final discussion regarding carbon footprint. So to get an impression, 

from a research in 2019 compares China, Germany, and the United States' relative energy sources to the top three cloud service providers. We believe this conversion provides a reasonable estimate of CO2 emissions per kilowatt hour of compute energy used because the energy breakdown in the United States is comparable to that of the most popular cloud compute service, Amazon Web Services. Table 1

ai text gen

Table 2. Estimated cost of training a model in terms of CO2 emissions (lbs) and cloud compute cost (USD).7 Power and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware.


On the other hand, we question if those computing power has been really used efficiently. AI and NLP researchers often rely on HPC data centers managed by cloud computing providers or their institutions if available. The efficiency of a datacenter varies through the day as well as through the year. A common metric used across the datacenter community to measure datacenter efficiency is Power Usage Effectiveness (PUE). According to So et al. (2019), their base model needs 10 hours to train for 300k steps on one TPUv2 core, and their whole architectural search lasted for a total of 979M training steps. 32,623 TPU hours or 274,120 hours on 8 P100 GPUs are the equivalent in this case. According to Peters et al. (2018), ELMo was trained for two weeks on three NVIDIA GTX 1080 GPUs (336 hours). The BERT base model (110M parameters) was trained on 16 TPU chips for 4 days, according to Devlin et al. (2019) . (96 hours). NVIDIA claims that a BERT model can be trained utilizing 4 DGX-2H servers and a total of 64 Tesla V100 GPUs in under 3.3 days (79.2 hours) (Forster et al., 2019). When it comes 2019, a large model like GPT-2 was described in Radford et al. (2019) has 1542M parameters and is reported to require 1 week (168 hours) of training on 32 TPUv3 chips.

The estimation of the necessary resources is based on empirical data for the creation of big, existent AI models, like GPT-3. A multilingual model's data pre-processing takes roughly 5,000 to 10,000 CPU cores or 150–300 CPU servers (for example, data cleaning such as HTML splitting). GPT-3 training required 355 GPU years. The LEAM project calculates an order of magnitude of roughly 460 specialized AI servers in order to not only catch up with the state of the art but also to advance beyond it and to ensure flexibility for experimentation for real innovations in the AI field. Approximately 10 TB worth of storage resources must also be considered for each AI model.


Carbon footprint


Can you imagine how much Carbon will be produced by training a single AI model only once ? Well in 2019 an answer has been provided according to a 

from the University of Massachusetts Amherst. A single AI model may be trained to produce as much carbon dioxide as five cars do over the course of their careers. Remember this only included one practice run and with the model size in 2019. Energy consumption will increase significantly as the model is growing as we mentioned in WB1 and 2 already, additionally a factor is how frequently the model is being trained. Many huge organizations, who have the capacity to train tens of thousands of models each day, are seriously considering the problem. This most recent 

by Meta is a fantastic illustration of one such business that is investigating the environmental impact of AI, researching solutions, and making calls to action. research paper article

Factors influencing the carbon footprint of large models:

  • Model Size

    The larger the number of operations, the more energy is needed to train the model

  • Hardware Characteristics

    The amount of time needed to complete the work will depend on the throughput that the hardware can handle. Throughput per Watt will increase as hardware becomes more efficient.

  • Data center Efficiency

    The energy used is used to cool down the data center and meet other electrical needs in addition to powering the computers. Waste heat in data centers can also be reused for collective water heating, driving down the PUE (Power Usage Effectiveness).

  • Electricity Mmix

    An important consideration is the distribution of the energy sources used to power a data center, which is mostly determined by location. The carbon emissions per kWh of power depend on the electrical mix. The average carbon emission per kilowatt-hour of electricity produced today is 475 gCO2e/kWh, and a rising number of cloud providers' data centers power their hardware with only renewable or nuclear energy. Once more using Google Cloud as an example, their 86 Montreal facility reports 27gCO2e/kWH, which is 20 times lower than the global average.


Recommendations


We compiled some ideas for future projects to reduce their carbon impact after contrasting numerous studies and experiments.

Limitations

In order to be able to make responsible choices with regard to possible use cases for LLMs, we need to be aware of the limitations that are due to the renounce of symbolic rule based systems in favor of machine learning driven ones, in particular with regard to the domain of natural language. In particular this raises the questions around safety and data requirements when adapting a pre-trained model to our needs.


Safety

How can the model fail ?

Even the most sophisticated models to this day have shown to output toxic content such as slurs etc. or produce factually wrong statements while maintaining a very high level of coherence and an authoritative tone which is suited to evoke an unreasonable degree of confidence in its utterances. Tackling those problems is not trivial since the output space is essentially all linguistic form and the meaning we infer from it and it has been a great challenge to constrain that space to some well (enough) defined areas of meaning.

Obscuring its cluelessness

Factuality is a quality which requires a "grounded understanding" of the semantic area in question. "Pure" LMs have only been trained on the task of predicting words. Whereas fine tuning can be seen as a form of "grounding" on labels which pose as ground truth. Another source of such information at runtime can be regular rule based software or traditional databases/ontologies (see cicero). The underlying problem is that we have no reliable way of measuring the confidence in an utterance with respect to evidence of what is true about the world and what is not. Therefore an LM can "make stuff up" on the spot in a very elaborate way, sometimes mixing factual information with non factual information, thereby obscuring its actual cluesness.

Slurs

Ideally we would want those linguistic bots to adhere to ethical and social standards which reflect the values of the employing company. The obvious way to effectively achieve this is to have those values (and nothing else) encoded in the (pre) training data. Given the data requirements in LLMs or "very large" LMs (like the GPT-3 family) this is highly unrealistic and will be for the foreseeable future. Since the "meaning" we assign to those values is not a well defined concept in a formally encodable sense, traditional rule based heuristics (often as simple as matching blacklisted words) have shown limited success at alleviating the problem of undesired content. Another approach is to decrease the probability of unwanted output by fine tuning on manually created examples of desired interaction patterns (as employed, arguably with some success, by openAI in chatGPT), which usually adds considerable complexity to the process, often requiring the creation of hand annotated data.

What can we do if the model fails ?

To put it simply: we have no automated way of effectively preventing undesired output. In fact the same can be said about most rule based software (required provable correctness is not very common and a very active research area). Crucially different is the approach to developing solutions based on modeling and understanding a problem as done by humans, which can be iteratively improved. Developers can support their assumptions about a problem with grounded world knowledge e.g. a physics engine as used in games or traction control in cars is based on well known physical laws. Moreover we have elaborate testing frameworks in place to test and retest our assumptions as a piece of software evolves and becomes more complex. In other words we can causally explain failures in traditional rule based systems and provide robust measures on how reliable a system is (under those assumptions). LMs on the other hand learn (approximate) those solutions all by themselves and all we can measure is their output. But since the output is linguistic form we have no way of analytically deduce its appropriateness since we are not able to formally encode values or factual correctness from it. Again we will need carefully constructed datasets to test against, which will necessarily cover only for a small fraction of undesired semantics. Here the power of LMs becomes their weakness. Undesired output cannot be "debugged" i.e. traced back, explained and fixed, instead we would have to change the training data because this is where all information the model can draw from originates, i.e. we change the distribution of the labels. In the case of language modeling this means removing offensive/counterfactual content, which, given the size of contemporary corpora, is yet unfeasible. It is therefore recommendable to guarantee a level of human supervision during usage which should be proportional to the level of criticality of correctness/acceptability in the output.

Conclusion

Contemporary LLMs show capabilities in leveraging world knowledge and linguistic structure in a way that makes their output oftentimes indistinguishable from humans in both relevance and coherence. This is due to the fact that, independently of whether with current methods we will be able to approach the representation of "meaning" as represented in humans, LMs have proven to be able to handle very complex linguistic structure and relations between things that exist in the world. This makes them undoubtedly very powerful tools which are suited to solve tasks that have seemed too complex to be solved by software alone only a little time ago.

The recent release by OpenAI of their latest chatGPT has further pushed the standards as it is able to answer questions and handle the implications of conversations between humans in an unseen way (conversation history, identifying intent etc.) about a very broad range of (even specific) topics. Unfortunately neither the data methodology nor other model details have been open sourced to this day, which would be a crucial enabler for research community driven advancements in risk mitigation. Perhaps the most complex task solved with help of LLMs today is ranking top 10% in the strategy game "diplomacy", where language needs to be used to convince and deceive others. Cicero , developed and open sourced by Meta, is an impressive example of engineered solution to a highly complex problem by using an ensemble of interacting ML and rule based systems.

While we have seen no widespread adoption in user facing products yet (machine translation and google search notwithstanding) the arising risks resulting from unmitigated biases in the data and (ab-) usage as knowledge sources in combination with the fact that humans might be misled by not knowing about the identity of their interaction partner, can not be ignored.

When deploying a LLM it is therefore recommendable to embed it into use cases where the supervision and therefore the liability inherently stays with the human (as in writing tools). At the same time we would advocate for frameworks for enforcing maximal transparency towards the engaging user about the nature of the interfaced system, in order to counter the dangers of eroding trust and misinformation at scale.