Google has given publishers concerned about how their content is used by AI systems the ability to block them from scanning their content.
By adding – or not adding – the Google-Extended token to their robots.txt file, publishers can decide whether or not the “crawler” programs that automatically scan a website can use their content to train the company’s Bard and Vertex AI generative AI systems.
While Google framed allowing crawler access as a way to “help these AI models become more accurate and capable over time,” it also gives publishers the ability to opt out of having their content used to train the company’s APIs, as well as future generations of those models.
“The rapid growth and development of generative AI tools is helping web publishers connect with their audiences more easily and creatively than ever before,” Danielle Romain, Google’s VP of trust, wrote in a company blog post. “We’re committed to developing AI responsibly, guided by our AI principles and in line with our consumer privacy commitment. However, we’ve also heard from web publishers that they want greater choice and control over how their content is used for emerging generative AI use cases.”
Romain added that additional controls for publishers when it comes to AI are currently being explored.
The proliferation of generative AI has also raised concerns about intellectual property rights and copyright among those creating the content that systems are trained on, from visual artists to musicians to news publishers. Many major publishers, including The New York TimesĀ and Medium, have already taken steps to block the crawlers that scrape content from the web that is then used to train systems like ChatGPT. However, blocking Google’s crawlers would also mean their sites would not be indexed in search, impacting their traffic – Google-Extended would allow them to block the AI crawlers while still allowing the ones that include web pages in the company’s search products.