Web Scraping and Personal Data: Legal and Ethical Boundaries in AI

April 11, 2025
regulate web scraping

By Vikrant Rana, Anuradha Gandhi, Rachita Thakur

Introduction

During the 267th Session of Parliament, Shri S. Niranjan Reddy, Member of Parliament from Andhra Pradesh highlighted concerns over web scraping of publicly available user data by social media companies for training Artificial Intelligence (AI) models. He questioned the Ministry of Electronics and Information Technology (MeitY) regarding measures to regulate web scraping, ensure transparency and establish accountability. In response, Shri Jitin Prasada, Minister of State in Ministry of Electronics and Information Technology emphasized that Indian laws, including the Information Technology Act (hereinafter referred to as IT Act, 2000), its associated Rules and Digital Personal Data Protection Act, 2023 (hereinafter referred to as DPDP Act)[1].

What is Web Scraping?

Web scraping, also referred to as data scraping, is an automated technique used to extract information from websites. This method can collect a variety of content including text, images, videos and metadata, which is then analyzed or stored for various uses.[2] It is an enabler to collect such data from platforms like Reddit, YouTube, Instagram, and LinkedIn.[3]

The result of AI training rest on three critical workflows[4]:

  1. Data extraction- collection of unstructured raw data from various sources.
  2. Filtering- removing irrelevant or low-quality content
  3. Dataset curation- organizing the cleaned data into structured formats suitable for training

What is it used for?

Web scraping is an essential tool to build foundational models for AI. It is also commonly used to:

  • Auction and bidding platforms
  • Travel and flight information
  • Event tracking
  • Real-time updates
  • Collection of job listings across various platforms
  • Creating lead generation lists
  • Competitor tracking
  • Sentiment analysis
  • Industry research
  • Price change monitoring[5]

Report of the work undertaken by the ChatGPT Taskforce 23 May 2024[6]

Web scraping allows the automated collection and extraction of certain information from different publicly available sources on the Internet which are then used for training purposes of ChatGPT. Such information can contain personal data which can include aspects of the personal life of the rdata subject. Depending on the source, the scraped data can also contain special categories of personal data within the meaning of Article 9(1) GDPR.

Therefore, processing has to be based on

  1. Existence of a legitimate interest,
  2. Necessity of processing, as the personal data should be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed and balancing of interests
  3. Fundamental rights and freedoms of data subjects on one hand and the controller’s legitimate interests on the other hand have to be evaluated and balanced carefully. The assessment should consider what the data subject can reasonably expect regarding the use of their data[7]. The mere fact that personal data is publicly accessible does not imply that “the data subject has manifestly made such data public”. There has to be an understanding of part of the data subject and an affirmative action on data subject’s part indicating so.

Privacy concerns as well as some other concerns in Web scraping[8]

  1. Data if posted on hacking forum may be used by malicious actors in targeted social engineering or phishing attacks.
  2. Data may be used to submit fraudulent loans or credit card applications, or to impersonate the individual by creating fake social media accounts.
  3. Data is also being used for monitoring and profiling, scraped data may be used to populate facial recognition databases.
  4. Marketing of data can be another concern; such data can be used to send unsolicited bulk marketing messages.
  5. Web scraping may also raise ethical concerns especially if personal or sensitive personal data is scraped from individuals or groups without their knowledge or consent leading to use of data for purposes not intended by the data principle such as profiling, influencing, etc.

Legality of web scraping

Web scraping is not illegal explicitly and there are no specific laws prohibiting web scraping, many companies even employ it in a legitimate way to gain data-driven insights. The terms of use and privacy policy of various websites may not allow web scraping, while there may be fewer restrictions for scraping publicly available data – as opposed to private information – there is still a need to ensure that other laws are not being breached which may apply to such data.

In the case of Meta Platform Inc. vs. Bright Data Ltd.[9], the U.S. Federal court ruled that engaging in logged-off scraping of public data does not amount to breach of contract, public right to access and collect public web data is legal as Bright data was scraping publicly available data, which is visible to anyone without logging-in. Bright Data was not scraping any data what was protected by passwords.

This opinion was built on the precedent HiQ Labs, Inc. v. LinkedIn Corporation[10] where HiQ Labs, a people analytics company that scraped publicly available data from LinkedIn servers, lead to a cease-and-desist letter being sent by LinkedIn for violating terms of use and Computer Fraud and Abuse Act (hereinafter referred to as CFAA) leading to unauthorized access. The Ninth Circuit- United States Court decision confirmed that one cannot be criminally liable for scraping publicly available data under the CFAA. Further there was no unauthorized access, the respondent was not using any copyrighted or protected data. The data subjects putting their information were aware that they had made their profiles public and if they had kept them private the access to those profiles would require authorization.

Whereas in another landmark case on web scraping, Clearview AI, Inc., an American facial recognition company used web crawlers to scan websites for images containing faces, including social media, professional sites, blogs and publicly accessible videos. The company targeted all publicly available data and collected their biometric facial geometry. Through its internet scraping efforts, Clearview’s database quickly grew to a staggering three billion images.

This business practice eroded long-established rights of personal privacy and autonomy, supercharging governments’ power to surveil marginalized groups and threatening free expression.

The French data protection authority (CNIL) clarified that the ‘publicly accessible’ nature of data does not change its classification as personal data, nor does it grant a general authorization forits reuse or further processing, especially without the data subject’s knowledge. It further emphasized that biometric data are particularly sensitive because they are tied to our physical identity (who we are) and enable unique identification.

The Information Commissioner’s Office (ICO) ordered Clearview to stop obtaining and using the personal data of UK residents that is publicly available on the Internet and to delete any such data from its systems.[11]

Legal position of scraping data

As large data processing AI’s are based on black box models, putting a factual limit for data principal to intervene. Therefore, AI models recommend data subject to shift from rectification to erasure when rectification is not feasible due to technical complexity. Further various data privacy laws have been implemented which provide mechanisms for consent, notice and opt out for data principles.[12]

India

Under Section 43 of the IT Act, unauthorized access to computer systems is penalized with compensation provided to affected parties. Additionally, the Information Technology (Intermediary Guidelines and Digital Media Ethics Code) Rules,2021, requires social media companies to prevent the hosting, sharing or displaying of content that violates the law and implement measures for data protection.

The recently enacted DPDP Act, mandates organizations to seek user consent before processing personal data and ensures robust compliance frameworks. The Act establishes accountability through Data Protection Board of India, which can investigate complaints and impose penalties of upto INR 250 crore for non- compliance and safeguard user rights.

If the material scraped is copyrighted material, the Copyright Act, 1957 will be applicable and it will be considered an infringement of copyright law leading to penal actions.

Other Jurisdiction

  1. For instance, General Data Protection Regulation (GDPR) restricts what businesses can do with the contact data they wish to extract.[13] It regulates how company collect, process and personal data.
  2. California Consumer Privacy Act (CCPA) also regulates the collection and processing of personal information of California residents by businesses.

Conclusion

Web scraping although not explicitly illegal, is still governed by various data privacy laws as it may involve the personal data of individuals or group (data principles). Social media companies shall restrict from such activities as major population using social media sites are youngster , whose data if leaked will not only have negative implications on future of youngster but also will have penal implication on such entities involved in process of scraping without consent. Further nowadays AI platforms are providing opt out features to data principle as it is an important mechanism to protect individual data, but still concerns arise in case one individual put the other persons data on such platforms.

Abhishekta Sharma , Junior Associate Advocate at S.S. Rana & Co. has assisted in the research of this article.

[1]https://sansad.in/getFile/annex/267/AU558_hhtT2g.pdf?source=pqars

[2]https://www.miquido.com/ai-glossary/what-is-ai-data scraping/#:~:text=AI%20web%20scraping%20enhances%20the,structures%20without%20needing%20manual%20reconfiguration.

[3]https://scrapfly.io/use-case/ai-training-web scraping#:~:text=Web%20scraping%20enables%20you%20to,computer%20vision%2C%20and%20other%20applications.

[4]https://oxylabs.io/blog/web-scraping-ai-training

[5] https://medium.com/@RenderAnalytics/high-value-web-scraping-use-cases-86334ba24e5c

[6] https://www.edpb.europa.eu/system/files/2024-05/edpb_20240523_report_chatgpt_taskforce_en.pdf

[7]Article 6(1)(f) General Data Protection Regulation and Recital 47 GDPR

[8]https://ico.org.uk/media/about-the-ico/documents/4026232/joint-statement-data-scraping-202308.pdf

[9]Case No. 3:23-cv-00077-EMC (N.D. Cal.)

[10]https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf

[11]https://www.hicomply.com/blog/clearview-ai-and-the-ethics-of-data-scraping-for-facial-recognition

[12]https://www.edpb.europa.eu/system/files/2024-05/edpb_20240523_report_chatgpt_taskforce_en.pdf

[13]Article 14 of GDPR

For more information please contact us at : info@ssrana.com