Data Cleaning
Discover how ReDem automates and simplifies data cleaning.
Why data cleaning matters?
Data cleaning helps ensure your results are based on valid, high-quality responses. ReDem automates this process to help you detect and exclude unreliable data.
ReDem’s data cleaning feature automates and streamlines this process, providing a standardized and transparent approach grounded in ReDem’s comprehensive evaluation framework.
How Cleaning Works
Every respondent evaluated by ReDem goes through a series of quality checks. These checks produce scores and classification labels, which form the basis of the cleaning logic.
ReDem offers two cleaning approaches:
1. ReDem Recommended Cleaning Settings
Our default settings apply best-practice thresholds to the following metrics:
-
ReDem Score (R-Score): Respondents with an R-Score below 60 are excluded.
-
Open-Ended Score (OES) & Response Categories: Respondents with an OES below 40 are excluded if they provide at least two open-ended responses.
Respondents flagged in at least two open-ended responses for wrong language, bad language, AI generated Answer, nonsense or wrong topic are excluded.
-
Coherence Score (CHS): Respondents with a CHS below 30 are excluded.
-
Grid-Question Score (GQS): Respondents with a GQS below 20 and at least two valid grid-question responses are excluded.
-
Time Score (TS): Respondents with a TS below 20 are excluded.
-
Behavioral Analytics Score (BAS): Respondents with a BAS below 20 and at least two valid BAS data points are excluded.
2. Custom Cleaning Settings
You can define your own thresholds for each quality metric. This enables fine-tuned control over what qualifies as low-quality data based on your specific needs.
Customizable elements include:
- ReDem Score (R-Score): Threshold
- Open-Ended Score (OES): Threshold + min. number of open-ended responses + category-based exclusion logic
- Open-Ended Response Categories:
- Generic Answer:
- Bad Language:
- No Information:
- Duplicate Respondent:
- Duplicate Answer:
- Nonsense:
- Wrong Language:
- Wrong Topic:
- AI Generated Answer:
- Open-Ended Response Categories:
- Time Score (TS): Threshold + min. number of time data points
- Grid-Question Score (GQS): Threshold + min. valid grid questions
- Coherence Score (CHS): Threshold + min. number of coherence data points
- Behavioral Analytics Score (BAS): Threshold + min. valid BAS data points + category-based exclusion logic
- BAS Categories:
- Unnatural Typing:
- Copy and Paste:
- BAS Categories:
Here’s an example of a cleaning settings object that can be used to customize the cleaning process via the API:
What Is the Outcome of the Cleaning Process?
Each respondent is classified as either:
- Included
- Excluded
Reasons for exclusion are clearly documented and based on whether any of the active criteria are triggered. Only one exclusion condition needs to be met for a respondent to be removed.
- ReDem Score below threshold
- Open-Ended Score below threshold
- Open-Ended categories exceed allowed count
- Time Score below threshold
- Grid-Question Score below threshold
- Coherence Score below threshold
- Behavioral Analytics Score below threshold
- BAS categories (e.g. Copy-Paste, Unnatural Typing) exceed allowed count
This transparent system ensures your data cleaning is auditable, adjustable, and aligned with your quality requirements.
Example Case: Cleaning in Practice
Here’s a sample configuration and its impact on a respondent:
This configuration will exclude a respondent who:
- Scores < 60 on the ReDem Score
- Has 2+ generic answers
- Has 2+ “No Information” responses
- Or even just 1 flagged AI-generated response (due to its lower tolerance threshold)