Data Cleaning

Why data cleaning matters?

Data cleaning helps ensure your results are based on valid, high-quality responses. ReDem automates this process to help you detect and exclude unreliable data. ReDem’s data cleaning feature automates and streamlines this process, providing a standardized and transparent approach grounded in ReDem’s comprehensive evaluation framework.

How Cleaning Works

Every respondent evaluated by ReDem goes through a series of quality checks. These checks produce scores and classification labels, which form the basis of the cleaning logic. ReDem offers two cleaning approaches:

1. ReDem Recommended Cleaning Settings

Our default settings apply best-practice thresholds to the following metrics:

ReDem Score (R-Score): Respondents with an R-Score below 60 are excluded.
Open-Ended Score (OES) & Response Categories: Respondents with an OES below 40 are excluded if they provide at least two open-ended responses. Respondents flagged in at least two open-ended responses for wrong language, bad language, AI generated Answer, nonsense or wrong topic are excluded.
Coherence Score (CHS): Respondents with a CHS below 30 are excluded.
Grid-Question Score (GQS): Respondents with a GQS below 20 and at least two valid grid-question responses are excluded.
Time Score (TS): Respondents with a TS below 20 are excluded.
Behavioral Analytics Score (BAS): Respondents with a BAS below 20 and at least two valid BAS data points are excluded.

2. Custom Cleaning Settings

You can define your own thresholds for each quality metric. This enables fine-tuned control over what qualifies as low-quality data based on your specific needs. Customizable elements include:

ReDem Score (R-Score): Threshold
Open-Ended Score (OES): Threshold + min. number of open-ended responses + category-based exclusion logic
- Open-Ended Response Categories:
  - Generic Answer:
  - Bad Language:
  - No Information:
  - Duplicate Respondent:
  - Duplicate Answer:
  - Nonsense:
  - Wrong Language:
  - Wrong Topic:
  - AI Generated Answer:
Time Score (TS): Threshold + min. number of time data points
Grid-Question Score (GQS): Threshold + min. valid grid questions
Coherence Score (CHS): Threshold + min. number of coherence data points
Behavioral Analytics Score (BAS): Threshold + min. valid BAS data points + category-based exclusion logic
- BAS Categories:
  - Unnatural Typing:
  - Copy and Paste:

Here’s an example of a cleaning settings object that can be used to customize the cleaning process via the API:

  "redemScore": 60,
  "OES": {
    "activate": true,
    "score": 60,
    "minDataPoints":2,
    "categories": {
      "GENERIC_ANSWER": {"activate": true, "minDataPoints":2},
      "NO_INFORMATION": {"activate": true, "minDataPoints":3},
      // ... other categories ...
    }
  },
  "CHS": {"activate": true,"score": 50},
  "GQS": {"activate": true,"score": 40, "minDataPoints":2},
  "TS": {"activate": true,"score": 20},
  "BAS": {
    "activate": true,
    "score": 60, 
    "minDataPoints":2,
    "categories": {
      "UNNATURAL_TYPING": {"activate": true, "minDataPoints":2},
      "COPY_AND_PASTE": {"activate": true, "minDataPoints":2},
    }
  }

What Is the Outcome of the Cleaning Process?

Each respondent is classified as either:

Included
Excluded

Reasons for exclusion are clearly documented and based on whether any of the active criteria are triggered. Only one exclusion condition needs to be met for a respondent to be removed.

ReDem Score below threshold
Open-Ended Score below threshold
Open-Ended categories exceed allowed count
Time Score below threshold
Grid-Question Score below threshold
Coherence Score below threshold
Behavioral Analytics Score below threshold
BAS categories (e.g. Copy-Paste, Unnatural Typing) exceed allowed count

This transparent system ensures your data cleaning is auditable, adjustable, and aligned with your quality requirements.

Example Case: Cleaning in Practice

Here’s a sample configuration and its impact on a respondent:

  "redemScore": 60,
  "OES": {
    "activate": true,
    "score": 60,
    "minDataPoints":2,
    "categories": {
      "GENERIC_ANSWER": {"activate": true, "minDataPoints":2},
      "NO_INFORMATION": {"activate": true, "minDataPoints":2},
      "BAD_LANGUAGE": {"activate": false, "minDataPoints":2},
      "NONSENSE": {"activate": false, "minDataPoints":2},
      "DUPLICATE_ANSWER": {"activate": false, "minDataPoints":2},
      "DUPLICATE_RESPONDENT": {"activate": false, "minDataPoints":2},
      "WRONG_TOPIC": {"activate": false, "minDataPoints":2},
      "WRONG_LANGUAGE": {"activate": false, "minDataPoints":2},
      "AI_GENERATED_ANSWER": {"activate": true, "minDataPoints":1}
    }
  }

This configuration will exclude a respondent who:

Scores < 60 on the ReDem Score
Has 2+ generic answers
Has 2+ “No Information” responses
Or even just 1 flagged AI-generated response (due to its lower tolerance threshold)

Quality Checks

Core Features

Miscellaneous

Why data cleaning matters?

How Cleaning Works

1. ReDem Recommended Cleaning Settings

2. Custom Cleaning Settings

What Is the Outcome of the Cleaning Process?

Example Case: Cleaning in Practice

Quality Checks

Core Features

Miscellaneous

​Why data cleaning matters?

​How Cleaning Works

​1. ReDem Recommended Cleaning Settings

​2. Custom Cleaning Settings

​What Is the Outcome of the Cleaning Process?

​Example Case: Cleaning in Practice

Why data cleaning matters?

How Cleaning Works

1. ReDem Recommended Cleaning Settings

2. Custom Cleaning Settings

What Is the Outcome of the Cleaning Process?

Example Case: Cleaning in Practice