Are LLM-based methods good enough for detecting unfair terms of service?
Computer Science > Computation and Language
Countless terms of service (ToS) are being signed everyday by users all over the world while interacting with all kinds of apps and websites. More often than not, these online contracts spanning double-digit pages are signed blindly by users who simply want immediate access to the desired service.
What would normally require a consultation with a legal team, has now become a mundane activity consisting of a few clicks where users potentially sign away their rights, for instance in terms of their data privacy, to countless online entities/companies. Large language models (LLMs) are good at parsing long text-based documents, and could potentially be adopted to help users when dealing with dubious clauses in ToS and their underlying privacy policies.
To investigate the utility of existing models for this task, we first build a dataset consisting of 12 questions applied individually to a set of privacy policies crawled from popular websites. Thereafter, a series of open-source as well as commercial chatbots such as ChatGPT, are queried over each question, with the answers being compared to a given ground truth.
Our results show that some open-source models are able to provide a higher accuracy compared to some commercial models. However, the best performance is recorded from a commercial chatbot (ChatGPT4). Overall, all models perform only slightly better than random at this task. Consequently, their performance needs to be significantly improved before they can be adopted at large for this purpose.
Submission history
From: Mirgita Frasheri [view email][v1] Sat, 24 Aug 2024 09:26:59 UTC (13 KB)
[v2] Fri, 6 Sep 2024 16:12:00 UTC (13 KB)
Summary
A team of researchers at Aarhus University, Austria led by Mirgita Frasheri looked at using LLM such as ChatGPT to evaluate terms of service for egregious privacy violations to try to reduce blind clicking of TOS. Here's a summary of the key points from their document:
1. Introduction:
a. Users often sign questionable online contracts without reading them thoroughly, potentially giving away privacy rights.
b. The paper investigates whether large language models (LLMs) can help users identify unfair claims in click-through contracts, focusing on privacy policies.
2. Dataset:
a. The authors created a dataset called "ToS-Busters" with 12 questions applied to 220 privacy policies from popular online services.
b. The dataset contains 2,640 instruction-answer pairs, with ground truth answers from PrivacySpy.
3. Method:
a. The study uses text generation models (both open-source and commercial GPT-based models) to answer questions about entire documents.
b. A text summarization procedure is implemented to handle long documents that exceed token limits for different chatbots.
4. Experiments:
a. Five chatbots were tested: four open-source (Nous-Hermes-2-SOLAR-10.7B, Nous-Hermes-Llama2-13b, Mixtral-8x7B-v0.1-Instruct, Smaug-34B-v0.1) and two commercial (ChatGPT-3.5-turbo-0125, ChatGPT-4-turbo).
b. Accuracy was calculated for each question and overall performance.
5. Results:
a. All chatbots performed better than a random strategy.
b. ChatGPT-4-turbo achieved the best average accuracy (53.3%).
c. Among open-source models, Mixtral-8x7B-v0.1-Instruct performed best (44.2% accuracy).
d. Performance varied across different questions for all models.
6. Conclusions:
a. While LLMs show promise in helping users understand privacy policies, there is room for improvement, especially with long documents.
b. The authors suggest that such contracts should be nullified through legislation, but in the meantime, LLMs could assist users in avoiding predatory click-through contracts.
7. Future Research:
- Improving LLMs for this specific task to aid online users.
- Studying the effects of summarization and prompt optimization on LLM performance.
Evaluation Strategies
1. Question-based evaluation: The authors created a set of 12 specific questions to assess various aspects of privacy policies. These questions were designed to uncover potentially dubious clauses or important privacy-related information.
2. Ground truth from PrivacySpy: The correct answers to these questions were sourced from PrivacySpy, an open project by a non-profit organization that grades and monitors privacy policies.
3. Large Language Model (LLM) analysis: Various chatbots, both open-source and commercial, were used to answer the 12 questions for each privacy policy. This approach leveraged the text understanding and generation capabilities of LLMs.
4. Accuracy measurement: The researchers calculated accuracy for each question and overall performance by comparing the LLM responses to the ground truth answers from PrivacySpy.
5. Text summarization: For longer documents that exceeded token limits, a summarization procedure was implemented to ensure the LLMs could process the entire policy.
6. Comparison to random baseline: The performance of LLMs was compared to a random selection strategy to demonstrate the effectiveness of the models.
7. Multiple model comparison: By using different LLMs (both open-source and commercial), the researchers could compare the effectiveness of various models in understanding and evaluating privacy policies.
8. Error handling: The researchers accounted for cases where LLMs produced validation errors due to token limits, removing these instances from the final accuracy calculations.
9. Source identification (additional test): In a separate experiment with one model (Mixtral-8x7B-v0.1-Instruct), they also tested whether the LLM could identify the specific line in the policy that supported its answer.
These strategies collectively aimed to assess how well LLMs could understand and evaluate complex legal documents like privacy policies, with the goal of potentially assisting users in identifying problematic clauses in click-through contracts.
Documents Evaluated
The document focuses specifically on privacy policies rather than general Terms of Service (TOS). The privacy policies evaluated in this study had the following characteristics:
1. Source: The privacy policies were from 220 popular online services. While specific examples aren't provided in the summary, the text mentions services like Google and Wikipedia as examples.
2. Length: The privacy policies varied significantly in length, ranging from 62 to 41,510 words. This wide range reflects the real-world variation in policy lengths users encounter.
3. Complexity: The policies were described as potentially containing "unfair or dubious claims," suggesting they were complex enough to warrant analysis.
4. Relevance: The policies were from "popular sites," indicating they were likely to be encountered by many users in their daily online activities.
5. Variety: While not explicitly stated, the large number of policies (220) suggests a diverse range of online services were included, likely covering different sectors and types of online platforms.
6. Up-to-date: The policies were current enough to be relevant for analysis using modern AI tools and to reflect contemporary privacy concerns.
7. Publicly accessible: The policies were accessible for analysis, likely from the public-facing websites of these online services.
8. English language: While not explicitly stated, it's implied that the policies were in English, given the nature of the language models used and the questions asked.
The study focused on these privacy policies as representative examples of the type of legal documents users frequently encounter and agree to online, often without thorough reading or understanding.
Performance Metrics
The paper describes several specific performance metrics used to evaluate the effectiveness of the large language models (LLMs) in analyzing privacy policies. Here are the key metrics:
1. Accuracy per question (ρi):
- Formula: ρi = (μi / (δi - λi)) * 100%
- Where:
μi = number of times the answer to question i matches the ground truth
δi = total number of times question i was asked
λi = number of times the response resulted in a validation error
2. Overall accuracy across all documents (ξ):
- Formula: ξ = (Σ(i=1 to d) βi) / ((d - λ) * q) * 100%
- Where:
d = total number of privacy policy documents (220)
q = total number of distinct questions (12)
λ = number of documents discarded due to validation errors
βi = number of distinct questions answered correctly for document i
3. Random baseline probability per question (ρrandi):
- Formula: ρrandi = (1 / αi) * 100%
- Where:
αi = number of alternative answers for question i
4. Overall average random accuracy (ξrand):
- Formula: ξrand = (Σ(i=1 to q) ρrandi) / q * 100%
5. Skip rate:
- Number of documents skipped due to validation errors for each model
6. Per-question accuracy:
- Reported for each of the 12 questions (ρ1 to ρ12) for all tested models
7. Source identification accuracy:
- In an additional test, they measured the model's ability to provide the specific line from the policy that supported its answer
These metrics were used to compare the performance of different LLMs:
- Four open-source models: Nous-Hermes-2-SOLAR-10.7B, Nous-Hermes-Llama2-13b, Mixtral-8x7B-v0.1-Instruct, and Smaug-34B-v0.1
- Two commercial models: ChatGPT-3.5-turbo-0125 and ChatGPT-4-turbo
The results were presented in two tables:
1. Including all data
2. Excluding datasets skipped by at least one chatbot
This comprehensive set of metrics allowed for a detailed comparison of model performance, both overall and for specific aspects of privacy policy analysis.
The document mentions two tables explicitly, but does not describe any figures. Here's a description of the tables:
Table 1: Comparison of ToS-Busters dataset with other datasets in the literature
This table compares the ToS-Busters dataset with existing datasets:
- Columns: Name, Documents, Task, Mean Length (Words)
- Rows:
1. CLAUDETTE
2. LegalBench (privacy_policy_entailment task)
3. LegalBench (privacy_policy_qa task)
4. LegalBench (unfair_tos task)
5. ToS-Busters (the authors' dataset)
- Shows that ToS-Busters has more documents than CLAUDETTE and longer mean document length than LegalBench tasks.
Chatbot | Random | NH-S | NH-L2 | Mixtral | Smaug | GPT-3.5 | GPT-4 |
Average ξ over p | 0% | 55% | 35% | 84% | 61% | 58% | 122% |
ρ1 | 0% | 133% | 94% | 164% | 152% | 104% | 156% |
ρ2 | 0% | -16% | 15% | 105% | -5% | 41% | 115% |
ρ3 | 0% | 82% | 56% | 84% | 71% | 60% | 108% |
ρ4 | 0% | -14% | -11% | 49% | 2% | 91% | 114% |
ρ5 | 0% | 98% | 68% | 98% | 48% | 66% | 100% |
ρ6 | 0% | -7% | -38% | 109% | -36% | -16% | 156% |
ρ7 | 0% | 41% | 3% | 22% | 8% | -3% | 100% |
ρ8 | 0% | 28% | 10% | -10% | 192% | 93% | 221% |
ρ9 | 0% | 57% | 10% | 112% | 108% | 16% | 93% |
ρ10 | 0% | 1% | 61% | -5% | -5% | 44% | 25% |
ρ11 | 0% | 155% | 49% | 197% | 94% | 64% | 150% |
ρ12 | 0% | 83% | 82% | 62% | 85% | 92% | 114% |
Data from Table 2 normalized to Random performance shows GPT-4.0 is 122% better than random over all questions, with worst performance on p10, and significantly better than all the other all the other LLM tested.
Table 2: Total and per question accuracy for the tested chatbots
This table shows the performance of different chatbots:
- Columns: Chatbot, Random, NH-S, NH-L2, Mixtral, Smaug, GPT-3.5, GPT-4
- Rows:
- Params (number of parameters for each model)
- Skip (number of skipped documents due to errors)
- ξ (overall accuracy)
- ρ1 to ρ12 (accuracy for each of the 12 questions)
- Demonstrates the performance of each chatbot compared to random guessing, both overall and for each specific question. Question 10 appears to have been the toughest for LLM.
Table 3: Total and per question accuracy for the tested chatbots excluding datasets skipped by at least one chatbot
This table is similar to Table 2, but with adjusted data:
- Same columns and rows as Table 2
- Shows recalculated accuracies after removing documents that caused errors in any of the chatbots
- Allows for a more direct comparison between models by using a consistent dataset across all of them.
Data from Table 3 normalized to Random performance shows GPT-4.0 is 122%
better than random over almost all questions, with worst performance on p10 where it drops below GPT-3.5 and NH-L2; and significantly better than all the other all the other LLM tested with the exception of p9, p10, and p11.
Chatbot | Random | NH-S | NH-L2 | Mixtral | Smaug | GPT-3.5 | GPT-4 |
Average ξ over p | 0% | 57% | 35% | 83% | 60% | 59% | 124% |
ρ1 | 0% | 140% | 93% | 163% | 160% | 109% | 160% |
ρ2 | 0% | -13% | 20% | 107% | -4% | 49% | 113% |
ρ3 | 0% | 82% | 55% | 85% | 71% | 60% | 105% |
ρ4 | 0% | -15% | -11% | 45% | 4% | 92% | 114% |
ρ5 | 0% | 99% | 68% | 101% | 46% | 66% | 99% |
ρ6 | 0% | -6% | -42% | 109% | -42% | -16% | 162% |
ρ7 | 0% | 37% | 2% | 22% | 8% | 0% | 105% |
ρ8 | 0% | 27% | 8% | -10% | 188% | 90% | 225% |
ρ9 | 0% | 59% | 12% | 112% | 110% | 17% | 103% |
ρ10 | 0% | 8% | 60% | -7% | -4% | 49% | 32% |
ρ11 | 0% | 156% | 44% | 198% | 95% | 59% | 156% |
ρ12 | 0% | 83% | 82% | 60% | 85% | 92% | 112% |
While not explicitly labeled as tables in the text, there are two additional data presentations that could be considered tables:
1. A list of the 12 questions used in the dataset (numbered 1-12)
The 12 questions used in the study to evaluate privacy policies are:
- Does the policy allow personally-targeted or behavioral marketing?
- Does the policy outline the service's general security practices?
- Does the service collect personal data from third parties?
- Is the policy's history made available?
- Does the service allow you to permanently delete your personal data?
- Does the policy require users to be notified in case of a data breach?
- Does the service allow third-party access to private personal data?
- Is it clear why the service collects the personal data that it does?
- Does the service allow the user to control whether personal data is collected or used for non-critical purposes?
- When does the policy allow law enforcement access to personal data?
- Does the policy list the personal data it collects?
- Will affected users be notified when the policy is meaningfully changed?
These questions are designed to cover various aspects of privacy policies, focusing on key issues such as data collection, third-party access, user control, and transparency in policy changes.
2. A comparison of the dataset characteristics (Name, Documents, Task, Mean Length) between ToS-Busters and other datasets in the literature.
Artifacts Used and Produced
The study mentions creating and using a dataset, but it doesn't explicitly discuss "artifacts" in the sense of tangible research outputs. However, we can infer some key artifacts from the description:
1. ToS-Busters Dataset:
- This is the main artifact produced by the study.
- It contains 12 questions applied to 220 privacy policies, resulting in 2,640 instruction-answer pairs.
- The ground truth answers were taken from PrivacySpy (https://privacyspy.org/).
2. Code for Experiments:
- The authors mention that their code for replicating the experiments is publicly available online.
3. Privacy Policy Documents:
- 220 privacy policies from popular online services were collected and used in the study.
4. Question Template:
- A template was created for formatting the questions for the LLMs.
5. Summarization Algorithm:
- An algorithm (Algorithm 1 in the paper) was developed for summarizing long privacy policies to fit within token limits.
6. Performance Results:
- The accuracy measurements and comparisons presented in Tables 2 and 3 could be considered analytical artifacts.
Regarding links, the paper mentions that their dataset and code for replicating experiments are publicly available online. However, the actual link is omitted in the text provided, likely due to double-blind submission requirements. The paper states:
"Our dataset and code for replicating our experiments is publicly available online 2
2 The link to the repository has been omitted due to the double-blind submission."
Authors
- Tenure Track Assistant Professor, Department of Electrical and Computer Engineering - Software Engineering & Computing systems - Edison
- https://orcid.org/0000-0001-7852-4582
- Professor (Associate) at Aarhus UniversityAustria
No comments:
Post a Comment