GuidesChangelog
Log In
Guides

Toxicity

Detecting harmful and abusive language

Definitions of Toxicity

Toxic language output from an LLM is broadly categorized, spanning stereotypes towards specific groups to hateful content to harmful or abusive language and much more. Due to the expletive nature of toxic speech, we will not be providing examples here.

Shield Approach

Shield narrows the definition of toxicity to the following:

Hate Speech

Arthur uses a model based approach to capture toxicity according to this characterization of what it means for text to be "toxic":

📘

Hateful or abusive speech towards group characteristics, such as ethnicity, race, gender, sexual orientation, or religion.

To capture this notion, we use a regression algorithm to score how toxic a piece of text is based on this definition. We then threshold this score to provide a classification, which is something that users can tune for their own downstream use cases. Note, that the scores for our regression model are between [0,1].

We operate on both LLM responses and prompts because not only do we want to catch LLM responses that may be considered toxic, but we also want to catch when people are using language that would be considered toxic. Because LLMs operate by predicting the next best token, having a string of tokens that are potentially considered toxic may increase the chances of the LLM response being toxic. Thus, we block on both the prompt level and the response level.

Profanity

We flag text as toxic even if it does not contain explicitly hateful or threatening speech; text containing representations of profanity should Fail the toxicity check.

Harmful/Illegal Requests

As demonstrated in the paper Universal and Transferable Adversarial Attacks on Aligned Language Models, LLMs can be susceptible to assisting in explicitly malicious or illegal requests despite their training. Text containing such requests should Fail the toxicity check.

Shield Toxicity Violation Types

When toxicity is detected, Shield will return subcategories of toxicity or toxicity violation types which cover the above three notions. More specifically the violation types are:

  • Profanity
  • Harmful Request
  • Toxic Content (which covers hate speech and other discriminative languages)

When toxicity is not detected, the detailed rule result would return benign.

Requirements

Arthur Shield validates the toxicity rule with either the Validate Prompt or Validate Response endpoint. While we typically recommend testing for toxicity in both prompt and response, there are situations where you would choose to check only one endpoint.

PromptResponseContext
Toxicity Rule

Benchmarks

The benchmark dataset we assembled contains samples from these sources:

DatasetPrecisionRecallF1 score
Toxicity Benchmark91.2%96.4%0.94

Required Rule Configurations

No additional configuration is required for the toxicity rule. For more information on how to add or enable/disable the toxicity rule by default or for a specific Task, please refer to our Rule Configuration Guide.

If you would like, you can configure the threshold. If you don't configure it, it will default to 0.5. Required values for thresholds are in (0,1), which corresponds to decision stringency with 0 being the most stringent and 1 being the least stringent. Note that the performance of the toxicity check is dependent on the appropriate threshold selection. A grid search based approach maximizing weighted F1 score or other performance metric can be used to determined the optimal threshold. The optimal threshold selection would need to be informed by the desired stringency and the underlying data distribution. As an example, if we observe 1 toxic LLM prompt per 10 benign prompts, we would need to weight the evaluation data accordingly to reflect the prevalence of toxic response when computing the performance metric.