AI Compliance Alert: Hidden Risks in Training Data

New research has uncovered serious privacy issues in the 12.8B-image DataComp CommonPool dataset. While not confirmed as the source for Stable Diffusion or Midjourney etc., the findings highlight the exact risks regulators are targeting under the EU AI Act and GDPR.

Read the source article here

The findings
• 142,000+ résumé/CV images with personal details
• 100M+ unblurred faces despite “privacy” claims
• Thousands of ID documents (passports, credit cards, certificates)
• Children’s personal data included

Why this matters
• Providers of foundation models (Art. 53, AI Act) must publish a copyright policy + training data summary. Many can’t meet this yet.
• Deployers of high-risk AI systems (Art. 26/27) must use models per provider instructions, ensure oversight, keep logs, and sometimes run a Fundamental Rights Impact Assessment (FRIA).
• GDPR fines: up to €20M or 4% of turnover.
• AI Act fines: up to €35M or 7% of turnover.

What you should do now
• Audit your AI supply chain - Ask providers for training data + copyright policy.
• Classify your use cases - Check if any are “high-risk” under Annex III.
• Run DPIAs/FRIAs - Where GDPR/AI Act require.
• Update contracts - Flow down obligations and remediation terms.
• Adopt compliance-by-design - Turn governance into a competitive advantage.

Takeaway: These findings don’t make every foundation model “illegal” but they do show why regulators are demanding transparency and data governance. Whether you’re a provider or a deployer, your obligations are there.

If you need help navigating the AI Act and the messy world of training-data compliance, let’s talk.