With the rising use of generative AI tools, the use of web scraping techniques to collect personal data from websites and use them for training purposes is becoming an increasingly common practice. In the EU, however, are such practices always GDPR compliant?
Personal data do not lose their protections under GDPR simply by being published online. This implies that when collecting publicly available data, it is essential to comply GDPR principles, with many of these principles currently being challenged in cases of web scraping.
Recently, the European Data Protection Supervisor, with its Guidelines on Generative AI: embracing opportunities, protecting people, expressed concerns around scraping, especially with regard to:
Also, the Dutch data protection authority, with its Guidelines on scraping by private organization and individuals (in Dutch), highlighted a number of data protection risks connected with web scraping. Nevertheless, the Dutch guidelines only addressed the issue of the lawfulness of scraping and the elements that may affect the legitimate interest assessment required for mostly any kind of scraping activity.
Most significantly, the Dutch guidelines seem to conclude that only targeted scraping—i.e., very limited in terms of sources and purposes—is compatible with GDPR; although they leave it up to controllers to assess the lawfulness and viability of such processing on a case-by-case basis. Despite the limited scope of these guidelines, they mark one of the first attempts by EU data protection authorities to offer practical guidance on ensuring that scraping activities comply with EU legislation.
In this same respect, the EDPB, in its recent Report of the work undertaken by the ChatGPT Taskforce, albeit focused on OpenAI data processing, gave important indications as to certain technical measures that may reduce the impact of web scraping on individuals, by “defining precise collection criteria and ensuring that certain data categories are not collected or that certain sources (such as public social media profiles) are excluded from data collection” or adopting measures to “delete or anonymise personal data […] before the training stage.” It also seems to recognize that the characteristics of scraping justify the provision of a privacy notice only via public means (art. 14(5)(b) GDPR), provided that the reasons for such provision are duly documented by the controller.
Analogous recommendations have been put forward by the French data protection authority (CNIL), in its latest Factsheet on web scraping and the legitimate interest (in French). In particular, the CNIL reiterated the importance of defining collection criteria (excluding sensitive data and, possibly, any personal data) and of promptly deleting/anonymizing unnecessary personal data right after their collection. Amongst collection criteria, the CNIL, in line with the previous interpretations, suggested to:
It also introduced some interesting suggestions, to foster transparency of scraping practices as well as data subjects’ control over their data, including by:
Even though a holistic GDPR approach towards scraping is yet to be defined, more guidance from authorities on GDPR and scraping are expected to follow in the near future, especially after the conclusion of investigations on OpenAI data practices (for a recap on the Garante’s ‘saga’ on OpenAI, click here). In the meantime, in absence of a clear position, any scraping activity will inherently trigger data protection risks, leaving the controller with the onerous need to assess its compliance with GDPR (also via a DPIA) and to monitor any new developments in this regard.
Stay tuned with more insights on AI and scraping here.