Posts about LLMs

Journalism and AI

Here are are my written remarks for a hearing on AI and the future of journalism for the Senate Judiciary Subcommittee on Privacy, Technology, and the Law, on January 10, 2024.

I have been a journalist for fifty years and a journalism professor for the last eighteen.

  1. History

I would like to begin with three lessons on the history of news and copyright, which I learned researching my book, The Gutenberg Parenthesis: The Age of Print and its Lessons for the Age of the Internet (Bloomsbury, 2023):

First, America’s 1790 Copyright Act covered only charts, maps, and books. The New York Times’ suit against OpenAI claims that, “Since our nation’s founding, strong copyright protection has empowered those who gather and report news to secure the fruits of their labor and investment.” In truth, newspapers were not covered in the statute until 1909 and even then, according to Will Slauter, author of Who Owns the News: A History of Copyright (Stanford, 2019), there was debate over whether to include news articles, for they were the products of the institution more than an author. 

Second, the Post Office Act of 1792 allowed newspapers to exchange copies for free, enabling journalists with the literal title of “scissors editor” to copy and reprint each others’ articles, with the explicit intent to create a network for news, and with it a nation. 

Third, exactly a century ago, when print media faced their first competitor — radio — newspapers were hostile in their reception. Publishers strong-armed broadcasters into signing the  1933 Biltmore Agreement by threatening not to print program listings. The agreement limited radio to two news updates a day, without advertising; required radio to buy their news from newspapers’ wire services; and even forbade on-air commentators from discussing any event until twelve hours afterwards — a so-called “hot news doctrine,” which the Associated Press has since tried to resurrect. Newspapers lobbied to keep radio reporters out of the Congressional press galleries. They also lobbied for radio to be regulated, carving an exception to the First Amendment’s protections of freedom of expression and the press. 

Publishers accused radio — just as they have since accused television and the internet and AI — of stealing “their” content, audience, and revenue, as if each had been granted them by royal privilege. In scholar Gwenyth Jackaway’s words, publishers “warned that the values of democracy and the survival of our political system” would be endangered by radio. That sounds much like the sacred rhetoric in The Times’ OpenAI suit: “Independent journalism is vital to our democracy. It is also increasingly rare and valuable.” 

To this day, journalists — whether on radio or at The New York Times — read, learn from, and repurpose facts and knowledge gained from the work of fellow journalists. Without that assured freedom, newspapers and news on television and radio and online could not function. The real question at hand is whether artificial intelligence should have the same right that journalists and we all have: the right to read, the right to learn, the right to use information once known. If it is deprived of such rights, what might we lose?

  1. Opportunities

Rather than dwelling on a battle of old technology and titans versus new, I prefer to focus here on the good that might come from news collaborating with this new technology. 

First, though, a caveat: I argue it is irresponsible to use large language models where facts matter, for we know that LLMs have no sense of fact; they only predict words. News companies, including CNET, G/O Media, and Gannett, have misstepped, using the technology to manufacture articles at scale, strewn with errors. I covered the show-cause hearing for a New York attorney who (like President Trump’s former counsel, Michael Cohen) used an LLM to list case citations. Federal District Judge P. Kevin Castel made clear that the problem was not the technology but its misuse by humans. Lawyers and journalists alike must exercise caution in using generative AI to do their work. 

Having said that, AI presents many intriguing possibilities for news and media. For example:

AI has proven to be excellent at translation. News organizations could use it to present their news internationally.

Large language models are good at summarizing a limited corpus of text. This is what Google’s NotebookLM does, helping writers organize their research. 

AI can analyze more text than any one reporter. I brainstormed with an editor about having citizens record 100 school-board meetings so the technology could transcribe them and then answer questions about how many boards are discussing, say, banning books. 

I am fascinated with the idea that AI could extend literacy, helping people who are intimidated by writing tell and illustrate their own stories.

A task force of academics from the Modern Language Association concluded AI in the classroom could help students with word play, analyzing writing styles, overcoming writers’ block, and stimulating discussion. 

AI also enables anyone to write computer code. As an AI executive told me in a podcast about AI that I cohost, “English majors are taking the world back… The hottest programming language on planet Earth right now is English.” 

Because LLMs are in essence a concordance of all available language online, I hope to see scholars examine them to study society’s biases and clichés.

And I see opportunities for publishers to put large language models in front of their content to allow readers to enter into dialog with that content, asking their own questions and creating new subscription benefits. I know an entrepreneur who is building such a business. 

Note that in Norway, the country’s largest and most prestigious publisher, Schibsted, is leading the way to build a Norwegian-language large language model and is urging all publishers to contribute content. In the US, Aimee Rinehart, an executive student of mine at CUNY who works on AI at the Associated Press, is also studying the possibility of an LLM for the news industry. 

  1. Risks

All these opportunities and more are put at risk if we fence off the open internet into private fortresses.

Common Crawl is a foundation that for sixteen years has archived the entire web: 250 billion pages, 10 petabytes of text made available to scholars for free, yielding 10,000 research papers. I am disturbed to learn that The New York Times has demanded that the entire history of its content — that which was freely available — be erased. Personally, when I learned that my books were included in the Books3 data set used to train large language models, I was delighted, for I write not only to make money but also to spread ideas. 

What happens to our information ecosystem when all authoritative news retreats behind paywalls, available only to privileged citizens and giant corporations able to pay for it? What happens to our democracy when all that is left out in public for free — to inform both citizens and machines — is propaganda, disinformation, conspiracies, spam, and lies? I well understand the economic plight of my industry, for I direct a Center for Entrepreneurial Journalism. But I also say we must have a discussion about journalism’s moral obligation to an informed society and about the right not only to speak but to learn.

  1. Copyright

And we need to talk about reimaging copyright in this age of change, starting with a discussion about generative AI as fair and transformative use. When the Copyright Office sought opinions on artificial intelligence and copyright (Docket 2023-6), I responded with concern about an idea the Office raised of establishing compulsory licensing schemes for training data. Technology companies already offer simple opt-out mechanisms (see: robots.TXT).

Copyright at its origin in the Statute of Anne of 1710 was enacted not to protect creators, as is commonly asserted. Instead, it was passed at the demand of booksellers and publishers to establish a marketplace for creativity as a tradeable asset. Our concepts of creativity-as-content and content-as-property have their roots in copyright. 

Now along come machines — large language models and generative AI — that manufacture endless content. University of Maryland Professor Matthew Kirschenbaum warns of what he calls “the Textpocalypse.” Artificial intelligence commodifies the idea of content, even devalues it. I welcome this. For I hope it might drive journalists to understand that their value is not in manufacturing the commodity, content. Instead, they must see journalism as a service to help citizens inform public discourse and improve their communities. 

In 2012, I led a series of discussions with multiple stakeholders — media executives, creative artists, policymakers — for a project with the World Economic Forum on rethinking intellectual property and the support of creativity in the digital age. In the safe space of Davos, even media executives would concede that copyright is outmoded. Out of this work, I conceived of a framework I call “creditright,” which I’ve written is “the right to receive credit for contributions to a chain of collaborative inspiration, creation, and recommendation of creative work. Creditright would permit the behaviors we want to encourage to be recognized and rewarded. Those behaviors might include inspiring a work, creating that work, remixing it, collaborating in it, performing it, promoting it. The rewards might be payment or merely credit as its own reward.” It is just one idea, intended to spark discussion. 

Publishers constantly try to extend copyright’s restrictions in their favor, arguing that platforms owe them the advertising revenue they lost when their customers fled for better, competitive deals online. This began in 2013 with German publishers lobbying for a Leistungsschutzrecht, or ancillary copyright, which inspired further protectionist legislation, including Spain’s link tax, articles 15 and 17 of the EU’s Copyright Directive, Australia’s News Media Bargaining Code, and most recently Canada’s Bill C-18, which requires large platforms — namely Google and Facebook — to negotiate with publishers for the right to link to their news. To gain an exemption from the law, Google agreed to pay about $75 million to publishers — generous, but hardly enough to save the industry. Meta decided instead to take down links to news rather than being forced to pay to link. That is Meta’s right under Canada’s Charter of Rights and Freedoms, for compelled speech is not free speech. 

In this process, lobbyists for Canada’s publishers insisted that their headlines were valuable while Meta’s links were not. The nonmarket intervention of C-18 sided with the publishers. But as it turned out, when those links disappeared, Facebook lost no traffic while publishers lost up to a third of theirs. The market spoke: Links are valuable. Legislation to restrict linking would break the internet for all. 

I fear that the proposed Journalism Competition and Preservation Act (JCPA) and the California Journalism Protection Act (CJPA) could have similar effect here. As a journalist, I must say that I am offended to see publishers lobby for protectionist legislation, trading on the political capital earned through journalism. The news should remain independent of — not beholden to — the public officials it covers. I worry that publishers will attempt to extend copyright to their benefit not only with search and social platforms but now with AI companies, disadvantaging new and small competitors in an act of regulatory capture. 

  1. Support for innovation

The answer for both technology and journalism is to support innovation. That means enabling open-source development, encouraging both AI models and data — such as that offered by Common Crawl — to be shared freely. 

Rather than protecting the big, old newspaper chains — many of them now controlled by hedge funds, which will not invest or innovate in news — it is better to nurture new competition. Take, for example, the 450 members of the New Jersey News Commons, which I helped start a decade ago at Montclair State University; and the 475 members of the Local Independent Online News Publishers; the 425 members of the Institute for Nonprofit News; and the 4,000 members of the News Product Alliance, which I also helped start at CUNY. This is where innovation in news is occurring: bottom-up, grass-roots efforts emergent from communities. 

There are many movements to rebuild journalism. I helped develop one: a degree program called Engagement Journalism. Others include Solutions Journalism, Constructive Journalism, Reparative Journalism, Dialog Journalism, and Collaborative Journalism. What they share is an ethic of first listening to communities and their needs. 

In my upcoming book, The Web We Weave, I ask technologists, scholars, media, users, and governments to enter into covenants of mutual obligation for the future of the internet and, by extension, AI. 

There I propose that you, as government, promise first to protect the rights of speech and assembly made possible by the internet. Base decisions that affect internet rights on rational proof of harms, not protectionism for threatened industries and not media’s moral panic. Do not splinter the internet along national borders. And encourage and enable new competition and openness rather than entrenching incumbent interests through regulatory capture. 

In short, I seek a Hippocratic Oath for the internet: First, do no harm.

A few unpopular opinions about AI

In a conversation with Jason Howell for his upcoming AI podcast on the TWiT network, I came to wonder whether ChatGPT and large language models might give all of artificial intelligence cultural cooties, for the technology is being misused by companies and miscast by media such that the public may come to wonder whether they can ever trust the output of a machine. That is the disaster scenario the AI boys do not account for.

While AI’s boys are busy thumping their chests about their power to annihilate humanity, if they are not careful — and they are not — generative AI could come to be distrusted for misleading users (the companies’ fault more than the machine’s); filling our already messy information ecosystem with the data equivalent of Styrofoam peanuts and junk mail; making news worse; making customer service even worse; making education worse; threatening jobs; and hurting the environment. What’s not to dislike?

Below I will share my likely unpopular opinions about large language models — how they should not be used in search or news, how building effective guardrails is improbable, how we already have enough fucking content in the world. But first, a few caveats:

I do see limited potential uses for synthetic text and generative AI. Watch this excellent talk by Emily Bender, one of the authors of the seminal Stochastic Parrots paper and a leading critic of AI hype, suggesting criteria for acceptable applications: cases where language form and fluency matter but facts do not (e.g., foreign language instruction), where bias can be filtered, and where originality is not required.

Here I explored the idea that large language models could help extend literacy for those who are intimidated by writing and thus excluded from discourse. I am impressed with Google’s NotebookLM (which I’ve seen thanks to Steven Johnson, its editorial director), as an augmentative tool designed not to create content but to help writers organize research and enter into dialog with text (a possible new model for interaction with news, by the way). Gutenberg can be blamed for giving birth to the drudgery of bureaucracy and perhaps LLMs can save us some of the grind of responding to it.

I value much of what machine learning makes possible today — in, for example, Google’s Search, Translate, Maps, Assistant, and autocomplete. I am a defender of the internet (subject of my next book) and, yes, social media. Yet I am cautious about this latest AI flavor of the month, not because generative AI itself is dangerous but because the uses to which it is being put are stupid and its current proprietors are worrisome.

So here are a few of my unpopular opinions about large language models like ChatGPT:

It is irresponsible to use generative AI models as presently constituted in search or anywhere users are conditioned to expect facts and truthful responses. Presented with the empty box on Bing’s or Google’s search engines, one expects at least a credible list of sites relevant to one’s query, or a direct response based on a trusted source: Wikipedia or services providing the weather, stock prices, or sports scores. To have an LLM generate a response — knowing full well that the program has no understanding of fact — is simply wrong.

No news organization should use generative AI to write news stories, except in very circumscribed circumstances. For years now, wire services have used artificial intelligence software to generate simple news stories from limited, verified, and highly structured data — finance, sports, weather — and that works because of the strictly bounded arena in which such programs work. Using LLMs trained on the entire web to generate news stories from the ether is irresponsible, for it only predicts words, it cannot discern facts, and it reflects biases. I endorse experimenting with AI to augment journalists’ work, organizing information or analyzing data. Otherwise, stay away.

The last thing the world needs is more content. This, too, we can blame on Gutenberg (and I do, in The Gutenberg Parenthesis), for printing brought about the commodification of conversation and creativity as a product we call content. Journalists and other writers came to believe that their value resides entirely in content, rather than in the higher, human concepts of service and relationships. So my industry, at its most industrial, thinks its mission is to extrude ever more content. The business model encourages that: more stuff to fill more pages to get more clicks and more attention and a few more ad pennies. And now comes AI, able to manufacture no end of stuff. No. Tell the machine to STFU.

There will be no way to build foolproof guardrails against people making AI do bad things. We regularly see news articles reporting that an LLM lied about — even libeled — someone. First note well that LLMs do not lie or hallucinate because they have no conception of truth or meaning. Thus they can be made to say anything about anyone. The only limit on such behavior is the developers’ ability to predict and forbid everything bad that anyone could do with the software. (See, for example, how ChatGPT at first refused to go where The New York Times’ Kevin Roose wanted it to go and even scolded him for trying to draw out its dark side. But Roose persevered and led it astray anyway.) No policy, no statute, no regulation, no code can prevent this. So what do we do? We try to hold accountable the user who gets the machine to say bad shit and then spread it, just as we would if you printed out nasty shit on your HP printer and posted it around the neighborhood. Not much else we can do.

AI will not ruin democracy. We see regular alarms that AI will produce so much disinformation that democracy is in peril — see a recent warning from Jon Naughton of The Guardian that “a tsunami of AI misinformation will shape next year’s knife-edge elections.” But hold on. First, we already have more than enough misinformation; who’s to say that any more will make a difference? Second, research finds again and again that online disinformation played a small role in the 2016 election. We have bigger problems to address about the willful credulity of those who want to signal their hatreds with misinformation and we should not let tropes of techno moral panic distract us from that greater peril.

Perhaps LLMs should have been introduced as fiction machines. ChatGPT is a nice parlor trick, no doubt. It can make shit up. It can sound like us. Cool. If that entertaining power were used to write short stories or songs or poems and if it were clearly understood that the machine could do little else, I’m not sure we’d be in our current dither about AI. Problem is, as any novelist or songwriter or poet can tell you, there’s little money in creativity anymore. That wouldn’t attract billions in venture capital and the stratospheric valuations that go with it whenever AI is associated with internet search, media, and McKinsey finding a new way to kill jobs. As with so much else today, the problem isn’t with the tool or the user but with capitalism. (To those who would correct me and say it’s late-stage capitalism, I respond: How can you be so sure it is in its last stages?)

Training artificial intelligence models on existing content could be considered fair use. Their output is generally transformative. If that is true, then training machines on content would not be a violation of copyright or theft. It will take years for courts to adjudicate the implications of generative AI on outmoded copyright doctrine and law. As Harvard Law Professor Lawrence Lessig famously said, fair use is the right to hire an attorney. Media moguls are rushing to do just that, hiring lawyers to force AI companies to pay for the right to use news content to train their machines — just as the publishers paid lobbyists to get legislators to pass laws to get search engines and social media platforms to pay to link to news content. (See how well that’s working out in Canada.) I am no lawyer but I believe training machines on any content that is lawfully acquired so it can be inspired to produce new content is not a violation of copyright. Note my italics.

Machines should have the same right to learn as humans; to say otherwise is to set a dangerous precedent for humans. If we say that a machine is not allowed to learn, to read, to extract knowledge from existing content and adapt it to other uses, then I fear it would not be a long leap to declare what we as humans are not allowed to read, see, or know some things. This puts us in the odd position of having to defend the machine’s rights so as to protect our own.

Stopping large language models from having access to quality content will make them even worse. Same problem we have in our democracy: Pay walls restrict quality information to the already rich and powerful, leaving the field — whether that is news or democracy or machine learning — free to bad actors and their disinformation.

Does the product of the machine deserve copyright protection? I’m not sure. A federal court just upheld the US Copyright Office’s refusal to grant copyright protection to the product of AI. I’m just as happy as the next copyright revolutionary to see the old doctrine fenced in for the sake of a larger commons. But the agency’s ruling was limited to content generated solely by the machine and in most cases (in fact, all cases) people are involved. So I’m not sure where we will end up. The bottom line is that we need a wholesale reconsideration of copyright (which I also address in The Gutenberg Parenthesis). Odds of that happening? About as high as the odds that AI will destroy mankind.

The most dangerous prospect arising from the current generation of AI is not the technology, but the philosophy espoused by some of its technologists. I won’t venture deep down this rat hole now, but the faux philosophies espoused by many of the AI boys — in the acronym of Émile Torres and Timnit Gebru, TESCREAL, or longtermism for short — is noxious and frightening, serving as self-justification for their wealth and power. Their philosophizing might add up to a glib freshman’s essay on utilitarianism if it did not also border on eugenics and if these boys did not have the wealth and power they wield. See Torres’ excellent reporting on TESCREAL here. Media should be paying attention to this angle instead of acting as the boys’ fawning stenographers. They must bring the voices of responsible scholars — from many fields, including the humanities — into the discussion. And government should encourage truly open-source development and investment to bring on competitors that can keep these boys, more than their machines, in check.