Skip to main content

Direct Website Access Granted to ChatGPT: Unveiling New Possibilities

Since May 12, 2023 (or something similar), some users have had the opportunity to instruct ChatGPT to make statements directly online and live about specific websites. This was not possible before. Previous attempts to provide the prompt with a URL always resulted in statements like the following:

I apologize, as a purely text-based AI model, I cannot visit websites or directly access web content. However, I can try to provide you with information based on my existing knowledge about the website aircargobook.com.

However, based on the name of the website, it suggests that it is likely related to air freight or cargo shipments […].

Detect and react on ChatGPT access

First, it is important to know that every web client (such as a browser or reputable crawlers or bots) informs the accessed website about the program currently requesting the webpage through the User-Agent header. Most website operators are already familiar with this information from Google Analytics or the tracking software they use. However, it is important to note that this information is set by the accessing entity itself and can, therefore, be incorrect.

OpenAI or ChatGPT promises to set this User-Agent string in a compliant manner and also explains how the User-Agent string looks like. On https://platform.openai.com/docs/plugins/bot you can check the latest information. Currently its set to “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot” als UserAgent
Assuming that this information is entirely correct, we would then know which content and browser features are available to ChatGPT.

Just like other bots, OpenAI respects the robots.txt file. This file is designed to allow or block access to a website for entities like Google, Bing, or ChatGPT. It can be set for the entire website or specific subsections or directories.

Furthermore, the website that provided us with this information also informs us about the IP range from which our website is being accessed. This information may be useful later on.

Initial tests show that ChatGPT generally behaves correctly

Based on our initial tests, ChatGPT appears to adhere to the behavior described in the robots.txt file, depending on how we structure the prompt. If we have a website without a robots.txt file or with a robots.txt file that allows access to the website, ChatGPT will access it directly and without any issues. In the access logs, we can see that before accessing the website, the ChatGPT browser first checks what is specified in the robots.txt file and then proceeds to access the website accordingly. If we repeat the process with a website that does not permit access, ChatGPT will inform us about it.

I’m sorry, but I am unable to directly access the website xxx because the site’s robots.txt file is preventing me from doing so. The robots.txt file is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website.

172.31.12.220 – – [18/May/2023:19:52:14 +0000] “GET /robots.txt HTTP/1.1” 200 45 “-” “Python/3.9 aiohttp/3.8.4” “23.98.142.178, 15.158.51.228”

So it means that the access to the website is not performed through the ChatGPT user, but rather through a separate Python script, while still within the specified IP range. This way, if desired, we can determine the origin of the traffic.

Furthermore, in our brief research, none of the top 10 websites in Germany (according to ChatGPT) have explicitly treated access from ChatGPT differently than other bots. Some websites have more restrictions, while others have fewer. Some websites even address specific bots like the Google or Bing bot, but so far, we have not found any that explicitly target ChatGPT.

Webseite URL
Google www.google.de
YouTube www.youtube.de
Facebook www.facebook.de
Amazon www.amazon.de
eBay www.ebay.de
Spiegel Online www.spiegel.de
Deutsche Bahn www.bahn.de
Bild www.bild.de
Wikipedia de.wikipedia.org
DHL www.dhl.de

Why would I want to treat the traffic from AI platforms differently, if necessary?

Absolutely, apart from controlling access through the robots.txt file, it is also possible to actively identify that the access is being made by ChatGPT on the server side. This information can be utilized if desired.

Here are some potential use cases:

  1. Privacy and Security: Some website operators may not want AI models to crawl their websites, especially if they contain confidential or sensitive data. They could restrict or block certain content for AI access.
  2. Content Optimization: A website operator may find that AI models prefer or interpret certain types of content better. In such cases, the operator could decide to adapt the content for AI access to enhance the efficiency of AI in processing the website’s information.
  3. Prevention of Abuse: Concerns regarding the misuse of content by AI models, such as scraping or spamming, may lead some operators to restrict or prevent access for AI models.
  4. Resource Management: AI models may have high resource requirements, particularly when generating numerous page requests. Therefore, website operators might limit AI access to manage their server load.
  5. Compliance with Policies and Regulations: Depending on the legal jurisdiction and the type of information a website hosts, there may be legal or regulatory requirements specifying that AI models are subject to certain access rules or restrictions.
  6. User Experience: Occasionally, optimizing the content scanned by AI models can improve the relevance and accuracy of the answers generated by the AI.

Of course, we will all need to gather more experience to determine what makes sense or not in these scenarios. In a few weeks or months, we will surely have a better understanding.

No comments yet

Your Email address will not be published.

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.