AI helps add 10k more photos to OldNYC

How OldNYC Found 10,000 "Lost" Photos

OldNYC recently expanded its collection by 10,000 historic photos, bringing its total from 39,000 (in 2016) to 49,000 (in 2026). However, a significant hurdle remained: many of these newly added images, and thousands already in the archive, lacked precise location data. This is where AI offered a powerful solution to make these photos truly accessible.

In 2024, the OldNYC team deployed OpenAI's gpt-4o. Their goal: extract location details from existing image descriptions. gpt-4o was tasked with meticulously extracting geographical clues—street names, intersections, landmarks—from the often vague accompanying text. The AI wasn't "seeing" the images; it was reading the accompanying text, understanding geographical clues.

This approach helped them locate roughly 6,000 additional photos, significantly enhancing the discoverability of both new and existing entries. Now, OldNYC can pinpoint about 87% of its photos with usable location data, and an impressive 96% of those mapped images land in the correct spot. This improved historical accuracy by enabling researchers to cross-reference locations with greater confidence, and made the archive more accessible by allowing users to browse geographically, rather than solely by keyword.

Mapping the Past

Locating these images involved more than just AI reading text; it required a robust mapping infrastructure. OldNYC replaced the Google Maps Geocoding API with a combination of OpenStreetMap and specialized historical street datasets. This includes crucial data from the New York Public Library's historical streets project, essential for accurately placing photos at intersections that have changed names or even vanished over time.

The switch wasn't solely about better historical data. In late 2024, Google's Maps API pricing model changed, which would have cost OldNYC around $35 a month. For a volunteer-driven archive, this was a significant financial burden. Moving to OpenStreetMap vector tiles and MapLibre not only eliminated these costs but also provided OldNYC with faster rendering crucial for large datasets, smoother zooming for detailed historical exploration, and full control over map styling, allowing them to strip away modern distractions. This allows the team to remove anachronistic modern features, offering a clearer, more authentic view of the past.

Alt text: A historical map of New York City showing various landmarks, with digital pins highlighting newly geolocated historical photo sites. — Alt text: A historical map of New York

AI for Optical Character Recognition

Beyond geolocation, AI also refined how OldNYC processes text *within* images. In 2015, the project used a custom OCR (Optical Character Recognition) system, Ocropus, which was highly effective, achieving over 99% character accuracy.

However, in 2024, the team rebuilt the OCR system with gpt-4o-mini. This upgrade expanded text coverage from 25,000 to 32,000 images. When comparing both systems on identical images, the GPT-based approach outperformed the old one about 75% of the time, performing worse in only 2% of cases. It particularly excelled with high-resolution source images.

Interestingly, they discovered that feeding GPT additional context, like image titles, actually caused text hallucinations. The most accurate results came from providing the model *only* the image. This highlights an important insight for LLM-based OCR: sometimes, limiting context can improve accuracy. Interestingly, the original 2015 text-detection code still plays a role, cropping images before the GPT-based OCR takes over, demonstrating how specialized older tools can complement newer AI.

AI's Role in Historical Preservation

OldNYC stands as a powerful, practical example of AI's potential in historical preservation. Crucially, this project demonstrates AI's role not in generating or 'fixing' historical artifacts, but as a sophisticated tool for organizing, locating, and enhancing the accessibility of existing history.

This approach directly counters concerns about AI altering historical truth. By focusing on extracting verifiable information from existing data—whether text descriptions for geolocation or text within images for OCR—OldNYC shows how AI can be a force for enhancing accuracy and expanding access, rather than fabricating history. For managers of large archives or datasets, OldNYC offers a model for leveraging AI to intelligently organize and enhance their collections, particularly by focusing on non-generative tasks.

The key takeaway is to identify specific, non-generative tasks where AI can augment human effort, and to remain meticulous about data input. Ultimately, OldNYC illustrates that the future of historical archives lies not in AI replacing human historians, but in providing them with more powerful tools to conduct their invaluable research and preservation efforts.