Why data cleaning matters?

Data cleaning helps ensure your results are based on valid, high-quality responses. ReDem automates this process to help you detect and exclude unreliable data.

ReDem’s data cleaning feature automates and streamlines this process, providing a standardized and transparent approach grounded in ReDem’s comprehensive evaluation framework.

How Cleaning Works

Every respondent evaluated by ReDem goes through a series of quality checks. These checks produce scores and classification labels, which form the basis of the cleaning logic.

ReDem offers two cleaning approaches:

Our default settings apply best-practice thresholds to the following metrics:

  • ReDem Score (R-Score): Respondents with an R-Score below 60 are excluded.

  • Open-Ended Score (OES) & Response Categories: Respondents with an OES below 40 are excluded if they provide at least two open-ended responses.

    Respondents flagged in at least two open-ended responses for wrong language, bad language, AI generated Answer, nonsense or wrong topic are excluded.

  • Coherence Score (CHS): Respondents with a CHS below 30 are excluded.

  • Grid-Question Score (GQS): Respondents with a GQS below 20 and at least two valid grid-question responses are excluded.

  • Time Score (TS): Respondents with a TS below 20 are excluded.

  • Behavioral Analytics Score (BAS): Respondents with a BAS below 20 and at least two valid BAS data points are excluded.

2. Custom Cleaning Settings

You can define your own thresholds for each quality metric. This enables fine-tuned control over what qualifies as low-quality data based on your specific needs.

Customizable elements include:

  • ReDem Score (R-Score): Threshold
  • Open-Ended Score (OES): Threshold + min. number of open-ended responses + category-based exclusion logic
    • Open-Ended Response Categories:
      • Generic Answer:
      • Bad Language:
      • No Information:
      • Duplicate Respondent:
      • Duplicate Answer:
      • Nonsense:
      • Wrong Language:
      • Wrong Topic:
      • AI Generated Answer:
  • Time Score (TS): Threshold + min. number of time data points
  • Grid-Question Score (GQS): Threshold + min. valid grid questions
  • Coherence Score (CHS): Threshold + min. number of coherence data points
  • Behavioral Analytics Score (BAS): Threshold + min. valid BAS data points + category-based exclusion logic
    • BAS Categories:
      • Unnatural Typing:
      • Copy and Paste:

Here’s an example of a cleaning settings object that can be used to customize the cleaning process via the API:

"redemScore": 60,
  "OES": {
    "activate": true,
    "score": 60,
    "minDataPoints":2,
    "categories": {
      "GENERIC_ANSWER": {"activate": true, "minDataPoints":2},
      "NO_INFORMATION": {"activate": true, "minDataPoints":3},
      // ... other categories ...
    }
  },
  "CHS": {"activate": true,"score": 50},
  "GQS": {"activate": true,"score": 40, "minDataPoints":2},
  "TS": {"activate": true,"score": 20},
  "BAS": {
    "activate": true,
    "score": 60, 
    "minDataPoints":2,
    "categories": {
      "UNNATURAL_TYPING": {"activate": true, "minDataPoints":2},
      "COPY_AND_PASTE": {"activate": true, "minDataPoints":2},
    }
  }

What Is the Outcome of the Cleaning Process?

Each respondent is classified as either:

  • Included
  • Excluded

Reasons for exclusion are clearly documented and based on whether any of the active criteria are triggered. Only one exclusion condition needs to be met for a respondent to be removed.

  • ReDem Score below threshold
  • Open-Ended Score below threshold
  • Open-Ended categories exceed allowed count
  • Time Score below threshold
  • Grid-Question Score below threshold
  • Coherence Score below threshold
  • Behavioral Analytics Score below threshold
  • BAS categories (e.g. Copy-Paste, Unnatural Typing) exceed allowed count

This transparent system ensures your data cleaning is auditable, adjustable, and aligned with your quality requirements.

Example Case: Cleaning in Practice

Here’s a sample configuration and its impact on a respondent:

"redemScore": 60,
  "OES": {
    "activate": true,
    "score": 60,
    "minDataPoints":2,
    "categories": {
      "GENERIC_ANSWER": {"activate": true, "minDataPoints":2},
      "NO_INFORMATION": {"activate": true, "minDataPoints":2},
      "BAD_LANGUAGE": {"activate": false, "minDataPoints":2},
      "NONSENSE": {"activate": false, "minDataPoints":2},
      "DUPLICATE_ANSWER": {"activate": false, "minDataPoints":2},
      "DUPLICATE_RESPONDENT": {"activate": false, "minDataPoints":2},
      "WRONG_TOPIC": {"activate": false, "minDataPoints":2},
      "WRONG_LANGUAGE": {"activate": false, "minDataPoints":2},
      "AI_GENERATED_ANSWER": {"activate": true, "minDataPoints":1}
    }
  }

This configuration will exclude a respondent who:

  • Scores < 60 on the ReDem Score
  • Has 2+ generic answers
  • Has 2+ “No Information” responses
  • Or even just 1 flagged AI-generated response (due to its lower tolerance threshold)