Why data cleaning matters?

Data cleaning helps ensure your results are based on valid, high-quality responses. ReDem automates this process to help you detect and exclude unreliable data. ReDem’s data cleaning feature automates and streamlines this process, providing a standardized and transparent approach grounded in ReDem’s comprehensive evaluation framework.

How Cleaning Works

Every respondent evaluated by ReDem undergoes a series of quality checks. These checks generate data points, classification labels and scores, which form the basis of the cleaning logic. When cleaning, you define what is acceptable or unacceptable by setting thresholds for these elements. OR Condition:
  • All cleaning options operate as OR conditions.
  • You must select at least one score as a cleaning condition - usually the ReDem Score, since it is a comprehensive metric covering all selected quality checks.
  • You may also add other scores in an OR condition if needed.
Example:
If you set an R-Score threshold of 60, all respondents below 60 are flagged. If you also apply an OR condition for Time Score < 20, then even respondents with a valid R-Score (e.g., 70) are flagged if their Time Score is below 20 (speeding).
Data Points:
Data points indicate the number of measurements used to calculate a score.
Example:
If you set the Open-Ended Score threshold to 40 with two data points, then any interview with at least two open-ended responses and an overall Open-Ended Score below 40 is excluded.
Default Settings:
To simplify the process, ReDem provides recommended default settings that work well for many projects. You can, however, adjust them to match the specific needs of your study. The next two sections first describe the default settings and then explain how to select your own.
Our default settings apply best-practice thresholds to the following metrics:
  • ReDem Score (R-Score): Respondents with an R-Score below 60 are excluded.
  • Open-Ended Score (OES) & Response Categories: Respondents with an OES below 40 are excluded if they provide at least two open-ended responses. Respondents flagged in at least two open-ended responses for wrong language, bad language, AI generated Answer, nonsense or wrong topic are excluded.
  • Coherence Score (CHS): Respondents with a CHS below 30 are excluded.
  • Grid-Question Score (GQS): Respondents with a GQS below 20 and at least two valid grid-question responses are excluded.
  • Time Score (TS): Respondents with a TS below 20 are excluded.
  • Behavioral Analytics Score (BAS): Respondents with a BAS below 20 and at least two valid BAS data points are excluded.

2. Custom Cleaning Settings

You can define your own thresholds for each quality metric. This enables fine-tuned control over what qualifies as low-quality data based on your specific needs. Customizable elements include:
  • ReDem Score (R-Score): Threshold
  • Open-Ended Score (OES): Threshold + min. number of open-ended responses + category-based exclusion logic
    • Open-Ended Response Categories:
      • Generic Answer:
      • Bad Language:
      • No Information:
      • Duplicate Respondent:
      • Duplicate Answer:
      • Nonsense:
      • Wrong Language:
      • Wrong Topic:
      • AI Generated Answer:
  • Time Score (TS): Threshold + min. number of time data points
  • Grid-Question Score (GQS): Threshold + min. valid grid questions
  • Coherence Score (CHS): Threshold + min. number of coherence data points
  • Behavioral Analytics Score (BAS): Threshold + min. valid BAS data points + category-based exclusion logic
    • BAS Categories:
      • Unnatural Typing:
      • Copy and Paste:
Here’s an example of a cleaning settings object that can be used to customize the cleaning process via the API:
  "redemScore": 60,
  "OES": {
    "activate": true,
    "score": 60,
    "minDataPoints":2,
    "categories": {
      "GENERIC_ANSWER": {"activate": true, "minDataPoints":2},
      "NO_INFORMATION": {"activate": true, "minDataPoints":3},
      // ... other categories ...
    }
  },
  "CHS": {"activate": true,"score": 50},
  "GQS": {"activate": true,"score": 40, "minDataPoints":2},
  "TS": {"activate": true,"score": 20},
  "BAS": {
    "activate": true,
    "score": 60, 
    "minDataPoints":2,
    "categories": {
      "UNNATURAL_TYPING": {"activate": true, "minDataPoints":2},
      "COPY_AND_PASTE": {"activate": true, "minDataPoints":2},
    }
  }

What Is the Outcome of the Cleaning Process?

Each respondent is classified as either:
  • Included
  • Excluded
Reasons for exclusion are clearly documented and based on whether any of the active criteria are triggered. Only one exclusion condition needs to be met for a respondent to be removed.
  • ReDem Score below threshold
  • Open-Ended Score below threshold
  • Open-Ended categories exceed allowed count
  • Time Score below threshold
  • Grid-Question Score below threshold
  • Coherence Score below threshold
  • Behavioral Analytics Score below threshold
  • BAS categories (e.g. Copy-Paste, Unnatural Typing) exceed allowed count
This transparent system ensures your data cleaning is auditable, adjustable, and aligned with your quality requirements.

Example Case: Cleaning in Practice

Here’s a sample configuration and its impact on a respondent:
  "redemScore": 60,
  "OES": {
    "activate": true,
    "score": 60,
    "minDataPoints":2,
    "categories": {
      "GENERIC_ANSWER": {"activate": true, "minDataPoints":2},
      "NO_INFORMATION": {"activate": true, "minDataPoints":2},
      "BAD_LANGUAGE": {"activate": false, "minDataPoints":2},
      "NONSENSE": {"activate": false, "minDataPoints":2},
      "DUPLICATE_ANSWER": {"activate": false, "minDataPoints":2},
      "DUPLICATE_RESPONDENT": {"activate": false, "minDataPoints":2},
      "WRONG_TOPIC": {"activate": false, "minDataPoints":2},
      "WRONG_LANGUAGE": {"activate": false, "minDataPoints":2},
      "AI_GENERATED_ANSWER": {"activate": true, "minDataPoints":1}
    }
  }
This configuration will exclude a respondent who:
  • Scores < 60 on the ReDem Score
  • Has 2+ generic answers
  • Has 2+ “No Information” responses
  • Or even just 1 flagged AI-generated response (due to its lower tolerance threshold)