A lot of the conversation about AI in journalism is around the practice and ethics of using these new tools to generate news or article content. As a hyperlocal news organization we’ve not yet found any place where a large language model like ChatGPT can do more or better than our human staff when it comes to researching, reporting and assembling our publication. I’ll let you know if that changes.
The one area where I think it will have practical applications for us is in extracting key bits of data and other information from unstructured documents and communications, and bringing them in to a structured format that we can use in our reporting and that will save us time.
As a software developer, I’m historically used to writing very specialized functions that try to do this exact thing, and it can be a painful process. “Get this one dollar amount from this one PDF file, it will usually appear on page 2 but sometimes on page 4 unless this one other bit of text is present in which case it will be on page 6.” Oh, the hours one can spend weaving together endless lines of code that still don’t always work, just to save a few minutes of human time.
But AI models and interfaces can replace those one-off, fragile functions, and without a lot of time invested in the initial setup.
Here’s the first example of that we’ve put into production use:
A thing we publish every week is death notices, a short list of recent deaths as collected from obituaries, public records and incoming communications to our office. We offer this as a public service to complement our paid obituary placement offering.
To put these together, we need six pieces of information about the deceased:
- First name(s)
- Last name
- City and state of residence
- Date of death
- Funeral home name
We already have a tool in place that automatically scans local funeral home websites and extracts obituary information into a database. But different obituary websites offer different levels of detail with different levels of structure. At some point you hit diminishing returns in crafting software that tries to get the above values from those sites.
So we’ve augmented that process with an automated request to OpenAI’s text-davinci-003 model to “read” each full obituary we detect and try to extract the information we need.
Here’s the prompt we send to the text-davinci-003 API endpoint:
Given an obituary excerpt, return only a JSON response with six fields and their values: ‘deathDate’, the date of death, using 2023 if no year is specified, formatted as ‘YYYY-MM-DD’, ‘age’, the person’s age as an integer, ‘location’, the city and state where the person resided before dying formatted as ‘city, state’ using two-letter state code, defaulting to the state of Indiana if none is specified, ‘lastName’, the person’s last name, ‘firstNames’, the person’s formal, unabbreviated first and middle names,
and ‘confidence’, the percentage confidence that you have extracted the correct values as an integer between 0-100. The excerpt: …
And then we give it the first 100 words of the obituary text itself.
(I am not a professional prompt engineer, so I’m sure this could use some fine-tuning. And no, the year is not hardcoded, but I simplified the above for this post.)
The result we get back from our request looks like this:
"location": "Richmond, IN",
"firstNames": "Joseph John",
As a structured data response, we can use that in our software tool to fill in some of the fields we weren’t able to detect in our own scanning of the obituary. (If the confidence is below 95% or if any of the fields seem missing or off, we don’t use the result and instead flag it for further human review.)
From there, when it’s time to place the death notices feature on the printed page file in InDesign, we have a web interface that takes the accumulated database of auto-scanned obituaries and formats them nicely into a chunk of text that’s ready for copy/paste. Obviously we could also direct that same information to a web page or other destinations.
For an average of 32 death notices augmented by AI completion per weekly newspaper issue, this use of OpenAI is expected to cost us around $15 per year. Compared to the cost of a human’s time to extract the same information by reading obituaries, that’s a pretty good price point.
We’ve created a statement on our use of AI in our publishing activities, but the ethical implications for news gathering feel minimal here: the source material for each AI analysis is singular and clear versus relying on the black box of a language model’s training materials, the impact the AI has on to-be-published content is limited to a few clearly defined data fields, and a human still reviews and edits the result before publishing.
Still, it’s important to think through the longer term implications of this tool and its future iterations. We’re creating what might end up for some being the only accessible, printed record of real humans who lived and died in our community, and that’s not something we want to mess up or take lightly.