CBC Investigation Documents the Challenge for Writers: Works by Prominent Canadian Authors are Included in AI Training Data (So What Happens Now?)

Photo: Author

The CBC—which, by the way, has just announced it will be cutting 10% of its workforce owing to reductions  in the funding it receives from Parliament, cuts that unfortunately will probably curb its investigative programming—has “revealed” that works by a number of prominent Canadian authors such as Margaret Atwood, Alice Munro, Robertson Davies, Farley Mowat, Leonard Cohen and many more (1200 in total) are included in a massive database that has been used for artificial intelligence (AI) training and research. It was the Atlantic that first reported on the extent of the unauthorized ingestion of copyrighted works by Books 3. Over 190,000 works were copied, so the 2500 works by Canadian authors represent just 2% of the dataset. Nonetheless, the Writers’ Union of Canada reports that 15% of its membership has had their works included. This should come as no surprise given the “move fast and break things” approach of the hi-tech sector, a strategy that has been demonstrated in spades during the ongoing development of generative AI.

If prominent Canadian authors are included, I can guarantee the same situation will apply to writers from any country with a strong literary tradition in addition to the US; Britain of course, Ireland, Australia, and New Zealand not to mention European literatures. As outlined by Wired, the Books 3 dataset was developed by private researchers seeking to come up with a database to compete with the one developed by OpenAI. CBC did an extensive analysis to identify the Canadian works using ISBN codes, writer’s names, and book titles.

Whether or not the use of copyrighted works to train AI programs–which in turn produce “new” works–is a violation of copyright law is still up in the air. The US Authors Guild and a number of individual authors have sued OpenAI and others for copyright infringement, although the case has not gone smoothly to date. While conceptually it is easy to understand the wrath of authors when learning that their works have been used without authorization or compensation to create “new” works that may end up competing in the market with the original, proving infringement in a court of law is still a challenge.

Lawyer and copyright blogger Neil Turkewitz has reminded us that national legislation is required to comply with international treaty commitments. Of particular relevance in considering whether copyright infringement occurs in the process of training AI on content without authorization is the “three step test” incorporated into the Berne Convention and subsequent copyright treaties such as the TRIPS commitments of the World Trade Organization (WTO). Turkewitz points out that to be compatible with international law, exceptions to copyright protection must (1) apply only in certain special cases; (2) not conflict with a normal exploitation of a work; and (3) not unreasonably prejudice the legitimate interests of the author. He argues that it is difficult to accept that machine learning is a “special case”, that it does not conflict with the author’s normal exploitation of the work, i.e. generating economic returns, and therefore the unauthorized copying that occurs in fact unreasonably prejudices (i.e. damages) the author’s legitimate interests. Although current controversies are being litigated in national courts, it is vital to keep the requirements of international law in mind.

The tech companies have taken several different positions to defend their willy-nilly scraping of copyrighted content. Many of these are outlined in submissions to the US Copyright Office as part of its current study of AI and copyright. Among the arguments advanced are;

(1) the act of reproduction falls under fair use because the output is transformative

(2) the outputs are not derivative works based on the original content

(3) requiring licensing of copyrighted content to train AI models would be impractical and too expensive

(4) some other countries are allowing unauthorized ingestion of copyrighted data so not to do so would hold back US innovation

(5) investors have already spent billions on developing AI models and to upset the model now would be economically harmful.

By the way, while the tech companies are quick to dismiss the claims of rights-holders to protect their content from unlicensed use when creating AI generated content, they are equally quick to assert that the output of AI should merit copyright protection.

Canada, although slow off the mark, is now also examining the question of how copyright and AI intersect. The CBC investigation quoted Osgoode Hall law professor Carys Craig as stating that “it’s not clear that the inclusion of works in a dataset used to train a generative AI model does constitute copyright infringement…” True, but by extension this also means it is not clear that it does not. It is for that reason that Craig and others have called for a revision to Canadian copyright law to specifically allow for text and data mining (TDM). There is currently no such explicit provision in the Copyright Act and existing elements of the Act are unlikely to lend themselves to this purpose, as pointed out by Dalhousie University law professor Lucie Guibault. Worth noting is Prof. Guibault’s argument that there should be a TDM exception to enable research where the output does not compete with the original content. She uses the example of a professor of English surveying/sampling a range of writing on a particular topic in order to prepare an analysis of that genre of writing. That work would not in any way compete with the original works.

While I am not a lawyer (as I hasten to say each time that I step into the minefield of venturing an opinion), it seems to me that there are a couple of key issues that need to be resolved by the courts. One is whether reproduction of an original work for training AI is itself an infringement, and a second is whether that reproduction results in a derivative work based on the original, particularly one that competes with or substitutes for the original.  The second point could be case specific. For example, the unauthorized reproduction of an author’s work that is incorporated into a massive database containing tens of thousands of similar inputs, and which in turn produces a computer-generated work based on that extensive database, is less likely (it seems to me) to be infringing than the copying of a dozen works of an artist to produce a computer-generated image in the style of that artist. Put another way, to what extent can the output be tied back to the unauthorized input?  The Getty Images case in the UK, where Getty is suing AI image generator StabilityAI, may produce some answers given the strong correlation between Getty’s original images and Stable Diffusion’s output images. In some instances, the latter even reproduced Getty’s watermark.

While no-one has a crystal ball in determining where all this is going to end up as a result of the current court cases in the US and Britain, I will venture one more opinion. There will be no bright line emerging from the litigation. The results will be mixed, even possibly contradictory as sometimes happens between different circuits of the US justice system. With no clear carte blanche for the tech industry, there will be a degree of financial and legal hazard hanging over the development of various applications. Equally, authors probably won’t get a definitive and unqualified ruling that will permit them to stop unauthorized and unlicensed use in its tracks. What will be the result?

The outcome will likely be a thriving marketplace for licensing content, as we are already beginning to see. Companies with proprietorial databases, like Getty Images, the New York Times, Reuters, Bloomberg and so on, will not only strengthen their ability to block webcrawlers, but will expand their licensing activities. Already the Associated Press has a deal with OpenAI, and other creators of content, including Twitter (X), are moving to the licensing model. It provides a revenue stream for creators and is a way to avoid litigation by developers. The certainty of a licence will far outweigh the risk, uncertainty, and potential cost of not having one.

However, it is one thing for large organizations like the Associated Press, with established databases, to enter into negotiations with AI developers to license content, but what about all those disparate books and authors out there whose work is being used without permission? AI developers will say that it is impossible to contact, let alone negotiate with, all those rights-holders, thus creating an unsurmountable barrier to innovation. However, we already have structures to draw on that might help solve the problem.

Copyright collectives perform similar tasks when dealing with educational users, by issuing licenses and compensating rights-holders. Could organizations like the Copyright Clearance Center in the US, Access Copyright in Canada, the Copyright Licensing Agency in the UK and similar organizations elsewhere help solve the problem? Admittedly the inclusion of one or two works among a very large training set might not yield much revenue, but a start has to be made somewhere. If over 40 of Margaret Atwood’s works are included in the Books 3 database, or 75% of the books included in the popular CBC show, Canada Reads, are included, there is a problem that needs to be fixed. Of course, it would be up to rights-holders to decide whether to empower collectives to manage their rights, and the collectives will need to decide whether they have the capacity, or it is worth their effort, to do so.

Meanwhile, the consultation process in the US, Canada and elsewhere grinds on. As the interests of various stakeholders are weighed, there is unlikely to be a silver bullet the resolves the issue to the satisfaction of all. In the end, some form of compromise position will likely emerge from the process. It will, I hope, continue to respect the basic principles of copyright while adapting it to the age of AI.

© Hugh Stephens, 2023. All Rights Reserved.

Author: hughstephensblog

I am a former Canadian foreign service officer and a retired executive with Time Warner. In both capacities I worked for many years in Asia. I have been writing this copyright blog since 2016, and recently published a book "In Defence of Copyright" to raise awareness of the importance of good copyright protection in Canada and globally. It is written from and for the layman's perspective (not a legal text or scholarly work), illustrated with some of the unusual copyright stories drawn from the blog. Available on Amazon and local book stores.

One thought on “CBC Investigation Documents the Challenge for Writers: Works by Prominent Canadian Authors are Included in AI Training Data (So What Happens Now?)”

Leave a Reply

Discover more from Hugh Stephens Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading