You know what’s unexpectedly awesome? Digitizing old newspapers. I’ve been knee-deep in this project for the Sapulpa Herald, and let me tell you, it’s been a blast. There’s something magical about diving into the history of a town, watching decades unfold through the lens of local journalism.
The Sapulpa Herald isn’t just any paper – it’s the longest-running newspaper in Sapulpa, Oklahoma. These folks have been the voice of their community for over a century, chronicling the everyday stories that make up the fabric of small-town life. But here’s the thing: while their archives on newspapers.com were a treasure trove of history, their own website was… well, let’s just say it needed some love.
That’s where I came in. Tasked with bringing the Herald into the digital age, I found myself combing through countless digital clippings on newspapers.com. Each article was a window to the past, filled with stories waiting to be shared with a new generation of readers. The challenge? Finding a way to transform these archived pages into searchable, accessible content on the Herald’s website without losing the charm and character of the original stories.
At first, I was manually typing out each article, savoring every word but realizing that at this rate, I’d be retired before I finished the job. That’s when I decided to try out a few AI tools to see if they could help.
ChatGPT
I started with ChatGPT. While it’s useful for many tasks, its performance in transcribing old news articles was inconsistent. It was generally accurate, but it had some notable issues:
- It occasionally added unnecessary brackets around multiple words that weren’t present in the original text.
- Some words were changed without apparent reason.
A specific example of this was when it changed the phrase “the same person turned in the three calls” to “the same person turned in the three [false] alarms.” This kind of alteration, while minor, could potentially change the meaning or context of historical records.
Claude
Next, I tested Claude, my favorite AI language model. Its performance was mostly accurate, but it still had a few small discrepancies:
- Some words were changed without apparent reason.
- There were occasional issues with numerical values.
Specific examples of these discrepancies include:
- “time out for gas” was changed to “filling out for gas”
- “169 miles” was changed to “150 miles”
- “Miles traveled during the run” was changed to “miles driven during the test”
- “performed” was changed to “functioned”
While these alterations were minor, they demonstrate the importance of careful proofreading when using AI for historical document transcription.
Gemini
Finally, I tested Gemini. This AI tool demonstrated significantly better performance:
- It consistently produced perfect extractions with 99.999% accuracy.
- The extractions maintained the original wording and formatting.
However, there is an important limitation to note:
- Gemini has a maximum output of 8192 tokens (approximately 6000 words or 30,000 characters).
- Attempting to process text beyond this limit can lead to decreased accuracy.
There was one instance where Gemini switched to a narrow columnar format with hyphenated words, but this was easily corrected and did not recur in subsequent tests.
I’ve been using Gemini for several weeks now for our digitization project at the Sapulpa Herald. Its high accuracy within the token limit has significantly improved our workflow and the quality of our digital archives.
One of My Favorite Projects
Watching these old stories come to life online has been a blast. It’s like we’re building a digital time machine for Sapulpa. People can now dive into their town’s history with just a few clicks. And let me tell you, seeing folks connect with their past through these old articles? That’s the real payoff.
So, if you ever find yourself drowning in a sea of old text that needs digitizing, give Gemini a shot. Just remember – 8192 tokens is its magic number. Stay within that, and you’ll be golden.
Who knows? Maybe one day we’ll have AI that can handle entire newspaper archives in one go. But until then, I’ll be here, feeding Gemini one chunk of Sapulpa history at a time, and watching our digital archive grow. And honestly? I wouldn’t have it any other way.