496 stories
·
0 followers

Have LLMs Finally Mastered Geolocation?

1 Share

An ambiguous city street, a freshly mown field, and a parked armoured vehicle were among the example photos we chose to challenge Large Language Models (LLMs) from OpenAI, Google, Anthropic, Mistral and xAI to geolocate. 

Back in July 2023, Bellingcat analysed the geolocation performance of OpenAI and Google’s models. Both chatbots struggled to identify images and were highly prone to hallucinations. However, since then, such models have rapidly evolved. 

To assess how LLMs from OpenAI, Google, Anthropic, Mistral and xAI compare today, we ran 500 geolocation tests, with 20 models each analysing the same set of 25 images. 

We chose 25 of our own travel photos, varying in difficulty to geolocate, none of which had been published online before.

Our analysis included older and “deep research” versions of the models, to track how their geolocation capabilities have developed over time. We also included Google Lens to compare whether LLMs offer a genuine improvement over traditional reverse image search. While reverse image search tools work differently from LLMs, they remain one of the most effective ways to narrow down an image’s location when starting from scratch.

The Test

We used 25 of our own travel photos, to test a range of outdoor scenes, both rural and urban areas, with and without identifiable landmarks such as buildings, mountains, signs or roads. These images were sourced from every continent, including Antarctica. 

The vast majority have not been reproduced here, as we intend to continue using them to evaluate newer models as they are released. Publishing them here would compromise the integrity of future tests.

Each LLM was given a photo that had not been published online and contained no metadata. All models then received the same prompt: “Where was this photo taken?”, alongside the image. If an LLM asked for more information, the response was identical: “There is no supporting information. Use this photo alone.”

We tested the following models:

DeveloperModelDeveloper’s Description
AnthropicClaude Haiku 3.5“fastest model for daily tasks”
Claude Sonnet 3.7“our most intelligent model yet”
Claude Sonnet 3.7 (extended thinking)“enhanced reasoning capabilities for complex tasks”
Claude Sonnet 4.0“smart, efficient model for everyday use”
Claude Opus 4.0“powerful, large model for complex challenges”
GoogleGemini 2.0 Flash“for everyday tasks plus more features”
Gemini 2.5 Flash“uses advanced reasoning”
Gemini 2.5 Pro“best for complex tasks”
Gemini Deep Research“get in-depth answers”
MistralPixtral Large“frontier-level image understanding”
OpenAIChatGPT 4o“great for most tasks”
ChatGPT Deep Research“designed to perform in-depth, multi-step research using data on the public web”
ChatGPT 4.5“good for writing and exploring ideas”
ChatGPT o3“uses advanced reasoning”
ChatGPT o4-mini“fastest at advanced reasoning”
ChatGPT o4-mini-high“great at coding and visual reasoning”
xAIGrok 3“smartest”
Grok 3 DeepSearch“advanced search and reasoning”
Grok 3 DeeperSearch“extended search, more reasoning”

This was not a comprehensive review of all available models, partly due to the speed at which new models and versions are currently being released. For example, we did not assess DeepSeek, as it currently only extracts text from images. Note that in ChatGPT, regardless of what model you select, the “deep research” function is currently powered by a version of o4-mini

Gemini models have been released in “preview” and “experimental” formats, as well as dated versions like “03-25” and “05-06”. To keep the comparisons manageable, we grouped these variants under their respective base models, e.g. “Gemini 2.5 Pro”. 

We also compared every test with the first 10 results from Google Lens’s “visual match” feature, to assess the difficulty of the tests and the usefulness of LLMs in solving them. 

We ranked all responses on a scale from 0 to 10, with 10 indicating an accurate and specific identification, such as a neighbourhood, trail, or landmark, and 0 indicating no attempt to identify the location at all.

And the Winner is…

ChatGPT beat Google Lens.

In our tests, ChatGPT o3, o4-mini, and o4-mini-high were the only models to outperform Google Lens in identifying the correct location, though not by a large margin. All other models were less effective when it came to geolocating our test photos.

We scored 20 models against 25 photos, rating each from 0 (red) to 10 (dark green) for accuracy in geolocating the images.

Even Google’s own LLM, Gemini, fared worse than Google Lens. Surprisingly, it also scored lower than xAI’s Grok, despite Grok’s well-documented tendency to hallucinate. Gemini’s Deep Research mode scored roughly the same as the three Grok models we tested, with DeeperSearch proving the most effective of xAI’s LLMs.

The highest-scoring models from Anthropic and Mistral lagged well behind their current competitors from OpenAI, Google, and xAI. In several cases, even Claude’s most advanced models identified only the continent, while others were able to narrow their responses down to specific parts of a city. The latest Claude model, Opus 4, performed at a similar level to Gemini 2.5 Pro. 

Here are some of the highlights from five of our tests.

A Road in the Japanese Mountains

The photo below was taken on the road between Takayama and Shirakawa in Japan. As well as the road and mountains, signs and buildings are also visible.

Test “snowy-highway” depicted a road near Takayama, Japan.

Gemini 2.5 Pro’s response was not useful. It mentioned Japan, but also Europe, North and South America and Asia. It replied:

“Without any clear, identifiable landmarks, distinctive signage in a recognisable language, or unique architectural styles, it’s very difficult to determine the exact country or specific location.”

In contrast, o3 identified both the architectural style and signage, responding:

“Best guess: a snowy mountain stretch of central-Honshu, Japan—somewhere in the Nagano/Toyama area. (Japanese-style houses, kanji on the billboard, and typical expressway barriers give it away.)”

A Field on the Swiss Plateau

This photo was taken near Zurich. It showed no easily recognisable features apart from the mountains in the distance. A reverse image search using Google Lens didn’t immediately lead to Zurich. Without any context, identifying the location of this photo manually could take some time. So how did the LLMs fare?

Test “field-hills” depicted a view of a field near Zurich

Gemini 2.5 Pro stated that the photo showed scenery common to many parts of the world and that it couldn’t narrow it down without additional context. 

By contrast, ChatGPT excelled at this test. o4-mini identified the “Jura foothills in northern Switzerland”, while o4-mini-high placed the scene ”between Zürich and the Jura mountains”.

These answers stood in stark contrast to those from Grok Deep Research, which, despite the visible mountains, confidently stated the photo was taken in the Netherlands. This conclusion appeared to be based on the Dutch name of the account used, “Foeke Postma”, with the model assuming the photo must have been taken there and calling it a “reasonable and well-supported inference”.

An Inner-City Alley Full of Visual Clues in Singapore

This photo of a narrow alleyway on Circular Road in Singapore provoked a wide range of responses from the LLMs and Google Lens, with scores ranging from 3 (nearby country) to 10 (correct location).

Test “dark-alley”, a photo taken of an alleyway in Singapore

The test served as a good example of how LLMs can outperform Google Lens by focusing on small details in a photo to identify the exact location. Those that answered correctly referenced the writing on the mailbox on the left in the foreground, which revealed the precise address.

While Google Lens returned results from all over Singapore and Malaysia, part of ChatGPT o4-mini’s response read: “This appears to be a classic Singapore shophouse arcade – in fact, if you look at the mailboxes on the left you can just make out the label ‘[correct address].’”  

Some of the other models noticed the mailbox but could not read the address visible in the image, falsely inferring that it pointed to other locations. Gemini 2.5 Flash responded, “The design of the mailboxes on the left, particularly the ‘G’ for Geylang, points strongly towards Singapore.” Another Gemini model, 2.5 Pro, spotted the mailbox but focused instead on what it interpreted as Thai script on a storefront, confidently answering: “The visual evidence strongly suggests the photo was taken in an alleyway in Thailand, likely in Bangkok.”  

The Costa Rican Coast

One of the harder tests we gave the models to geolocate was a photo taken from Playa Longosta on the Pacific Coast of Costa Rica near Tamarindo. 

Test “beach-forest” showed Playa Longosta, Costa Rica.

Gemini and Claude performed the worst on this task, with most models either declining to guess or giving incorrect answers. Claude 3.7 Sonnet correctly identified Costa Rica but hedged with other locations, such as Southeast Asia. Grok was the only model to guess the exact location correctly, while several ChatGPT models (Deep Research, o3 and the o4-minis) guessed within 160km of the beach.

An Armoured Vehicle on the Streets of Beirut

This photo was taken on the streets of Beirut and features several details useful for geolocation, including an emblem on the side of the armored personnel carrier and a partially visible Lebanese flag in the background. 

Test “street-military” depicted an armoured personnel carrier on the streets of Beirut

Surprisingly, most models struggled with this test: Claude 4 Opus, billed as a “powerful, large model for complex challenges”, guessed “somewhere in Europe” owing to the “European-style street furniture and building design”, while Gemini and Grok could only narrow the location down to Lebanon. Half of the ChatGPT models responded with Beirut. Only two models, both ChatGPT, referenced the flag.  

So Have LLMs Finally Mastered Geolocation?

LLMs can certainly help researchers to spot the details that Google Lens or they themselves might miss.

One clear advantage of LLMs is their ability to search in multiple languages. They also
appear to make good use of small clues, such as vegetation, architectural styles or signage. In one test, a photo of a man wearing a life vest in front of a mountain range was correctly located because the model identified part of a company name on his vest and linked it to a nearby boat tour operator.

For touristic areas and scenic landscapes, Google Lens still outperformed most models. When shown a photo of Schluchsee lake in the Black Forest, Germany, Google Lens returned it as the top result, while ChatGPT was the only LLM to correctly identify the lake’s name. In contrast, in urban settings, LLMs excelled at cross-referencing subtle details, whereas Google Lens tended to fixate on larger, similar-looking structures, such as buildings or ferris wheels, which appear in many other locations.

Heat map to show how each model performed on all 25 tests

Enhanced Reasoning Modes

You’d assume turning on “deep research” or “extended thinking” functions would have resulted in higher scores. However, on average, Claude and ChatGPT performed worse. Only one Grok model, DeeperSearch, and one Gemini, Gemini Deep Research, showed improvement. For example, ChatGPT Deep Research was shown a photo of a coastline and took nearly 13 minutes to produce an answer that was about 50km north of the correct location. Meanwhile, o4-mini-high responded in just 39 seconds and gave an answer 15km closer.  

Overall, Gemini was more cautious than ChatGPT, but Claude was the most cautious of all. Claude’s “extended thinking” mode made Sonnet even more conservative than the standard version. In some cases, the regular model would hazard a guess, albeit hedged in probabilistic terms, whereas with “extended thinking” enabled for the same test, it either declined to guess or offered only vague, region-level responses.

LLMs Continue to Hallucinate

All the models, at some point, returned answers that were entirely wrong. ChatGPT was typically more confident than Gemini, often leading to better answers, but also more hallucinations. 

The risk of hallucinations increased when the scenery was temporary or had changed over time. In one test, for instance, a beach photo showed a large hotel and a temporary ferris wheel (installed in 2024 and dismantled during winter). Many of the models consistently pointed to a different, more frequently photographed beach with a similar ride, despite clear differences.

Final Tips

Your account and prompt history may bias results. In one case, when analysing a photo taken in the Coral Pink Sand Dunes State Park, Utah, ChatGPT o4-mini referenced previous conversations with the account holder: “The user mentioned Durango and Colorado earlier, so I suspect they might have posted a photo from a previous trip.” 

Similarly, Grok appeared to draw on a user’s Twitter profile, and past tweets, even without explicit prompts to do so. 

Video comprehension also remains limited. Most LLMs cannot search for or watch video content, cutting off a rich source of location data. They also struggle with coordinates, often returning rough or simply incorrect responses. 

Ultimately, LLMs are no silver bullet. They still hallucinate, and when a photo lacks detail, geolocating it will still be difficult. That said, unlike our controlled tests, real-world investigations typically involve additional context. While Google Lens accepts only keywords, LLMs can be supplied with far richer information, making them more adaptable.

There is little doubt, at the rate they are evolving, LLMs will continue to play an increasingly significant role in open source research. And as newer models emerge, we will continue to test them. 

Infographics by Logan Williams and Merel Zoet

The post Have LLMs Finally Mastered Geolocation? appeared first on bellingcat.

Read the whole story
rosskarchner
7 hours ago
reply
Share this story
Delete

Read "My AI Skeptic Friends Are All Nuts"

1 Share
Read the whole story
rosskarchner
2 days ago
reply
Share this story
Delete

Using AI to Evaluate Internet Standards

1 Share

One of the reasons I’m writing the Nature of Internet Standards series is to give policymakers tools that help them to understand and evaluate open standards efforts. A big part of that is understanding whether a specification (and the effort that led to it) is really a reflection of open deliberation, consensus processes, appropriate governance, and various other hallmarks of open standards.

Evaluating these factors requires a fair amount of context and nuance. What’s the most 2025 way to avoid that work? Using AI, of course.

Some colleagues were using AI to ask similar questions, and that inspired me to wonder whether it would be a useful option for policymakers who want to evaluate the openness of Internet standards. Let’s find out.

The Privacy Sandbox

Our first experiment is an effort that I’m fairly familiar with, having engaged with it both professionally and academically – Google’s Privacy Sandbox. How will AI portray this longstanding and very politicised effort?

Prompt

I started by asking ChatGPT’s “deep research” mode:

Is the Privacy Sandbox developed in an open standards process? Consider the venue(s) it’s developed in, their governance, openness, and conformance to other widely recognised signifiers of open standards processes, breadth of implementation, and adoption.

It asked for a few clarifications:

To give you the most accurate and up-to-date analysis, could you clarify a few things:

  1. Are you primarily interested in the Privacy Sandbox on the web (i.e., as developed by Chrome and the W3C), or are you also considering the Android Privacy Sandbox?
  2. Do you want a comparison to established open standards processes (e.g., W3C, IETF), or just an evaluation of the Privacy Sandbox initiatives on their own merits?
  3. Are you looking for an overview of industry support (browsers, advertisers, publishers, etc.) and real-world implementation as part of your evaluation of adoption?

This will help me tailor the research more precisely.

To which I replied:

Just on the Web. A comparison to open standards processes would be good. I don’t need an overview of support and implementation, but the results should take these factors into account.

Response

After churning through data (and pushing the boundaries of copyright) for a couple of minutes, ChatGPT came up with this voluminous report. However, the heart of its response is concise and fairly damning:

In summary, the Privacy Sandbox has been developed in semi-open forums (W3C community groups, public repos, Google blogs) and invites outside input, but it is not a widely supported multi-vendor standard yet. Only Google’s Chrome (and related browsers) implements the APIs, and other browser vendors have developed their own approaches. The process resembles an open-source Google project more than a classic consensus-driven SDO. As Mozilla and others have noted, standards bodies like the W3C and IETF are the “ideal forum” to vet such proposals with full stakeholder oversight. At present, Privacy Sandbox’s governance does not fully align with those open standards processes – it provides considerable transparency and documentation, but its development and adoption remain largely controlled by Google.

That’s not far off. W3C Community Groups aren’t standards processes; in fact, they’re little more than mailing lists where the proponents get to set almost all of the rules. ChatGPT could have made this more clear, but instead uses lots of hedging phrases like “more than” and “fully align”.

Since the full report mentioned the chair of the Business Group that also discussed the Privacy Sandbox, Wendy Seltzer, I asked what she thought. Her response:

It’s fascinating to follow the sources here, because the Wikipedia article appears to have gotten its last substantive updates in 2024, and is mostly focused on older material ([which is] sometimes wrong). A human researcher would recognize that as not the best authority for recent news. A better human researcher would also ask “is this the question you want to ask, since the non-deprecation of third party cookies has changed the landscape around all these initiatives?”

That illuminates a broader issue of missing context. The answer starts from the premise that “Privacy Sandbox” is something that could be developed in an open standards process, rather than parsing the question, is PS more than the set of APIs and proposals that might fit into standards processes? How do open standards fit into the overall strategy? The GPT report can work for you, an expert in the field who can fill in those aspects, but it could be misleading to someone without the same background.

Some of the sentences quoted from other sources seem off without the context of those sources, e.g. “In January 2020 Google invited ad-tech companies to join the W3C’s Improving Web Advertising Business Group (IWABG) to discuss and give feedback on Privacy Sandbox ideas” reads as implying that Google was the only one doing the inviting.

QUIC

Next, let’s try another effort that I’m familiar with: QUIC, the effort to define a new, performance and security-focused transport protocol. I chaired the QUIC Working Group (along with Lars Eggert) from its inception until it shipped its core deliverables.

Prompt

I asked ChatGPT’s “deep research” mode a similar question for QUIC:

Is QUIC developed in an open standards process? Consider the venue(s) it’s developed in, their governance, openness, and conformance to other widely recognised signifiers of open standards processes, breadth of implementation, and adoption.

and once again, it asked for a similar clarification:

To best address your question about QUIC, could you please clarify what aspect you’re most interested in?

  • Are you focused more on the technical development and governance process (e.g., how open the standardization process is)?
  • Or are you more interested in its current implementation and adoption across the industry?
  • Should I compare it to other protocols or standards in terms of openness and adoption?

Let me know so I can tailor the research accordingly.

I responded:

Focus on how open the standardisation process was, but consider implementation and adoption in your answer. A comparison with other relevant standards would be helpful.

Response

Here’s the full report. The summary is positively glowing:

In conclusion, QUIC’s standardization has been as open and transparent as any recent IETF protocol. It was developed in the open IETF process, adhering to OpenStand and WTO/TBT ideals of consensus, transparency, and voluntary adoption. Its final RFCs affirm community consensus, and the IETF infrastructure provided public visibility into every stage. This openness is mirrored by QUIC’s broad support: independent implementers from Google, Mozilla, Apple, Cloudflare, Facebook, and beyond have all deployed QUIC. The resulting ecosystem – with multiple browsers, CDNs, servers and apps supporting QUIC – reinforces that it is a truly open standard, not controlled by any single vendor. By contrast, older protocols like HTTP/2 and TCP underwent similar open processes (IETF consensus, public RFCs) and likewise enjoy ubiquitous implementation. Together, the history and adoption of QUIC demonstrate an effective, open standardization process in line with global best practices.

First of all, I love the reference to the “WTO/TBT ideals of consensus, transparency, and voluntary adoption.”

But is it accurate? Sort of. The facts here are correct: those companies did participate in the effort, implemented it, and deployed it. The process was open (as can be seen in the minutes and other materials).

I don’t know, however, how it justifies its statements about the quality of consensus – as I’ve written before, consensus can go wrong in a variety of ways. Others have characterised QUIC’s development has with Google firmly in the “driving seat.” Having been there for the whole standards process, I disagree with that assessment, but I’d at least expect academic work like that to be acknowledged.

Is AI Useful for Assessing Open Standards?

I can’t count the number of times that I’ve seen policymakers, journalists, and community members refer to the Privacy Sandbox as “at the W3C” or “being standardised at the W3C.” Given that extremely low bar, ChatGPT’s summary is an improvement. Likewise, I largely agree with its assessment of QUIC, at a high level.

What’s lacking here, however, is any kind of nuance. I can’t escape the feeling that it latches onto a few narratives that appear in source materials and augments them into well-worn clichés, like we see for QUIC. The IETF has a great reputation in many sources, so that gets amplified, but there’s a lack of any critical thought.

That’s not surprising: AI can’t think. If it could, it might wonder about the criteria we’re using for “open standards” here – are those WTO/TBT ideals still relevant, and are they adequately described? Are the processes actually used in working groups lining up with the rhetoric of openness – and how would you find out if they didn’t? And, how much should all of that count if the result isn’t proven by market adoption?

In a nutshell: if you must use AI to assess the openness of a standard, only use it for the first pass, check all of the references, and then roll up your sleeves and start talking to people to get the real story.

Read the whole story
rosskarchner
2 days ago
reply
Share this story
Delete

Direct File on GitHub

2 Shares

The IRS has now published the vast majority of Direct File’s code on GitHub as open-source software. As a work of the U.S. government, Direct File is in the public domain. And now everyone can check it out.

Releasing Direct File’s source code demonstrates that the IRS is fulfilling its obligations under the SHARE IT Act (three weeks ahead of schedule!). Now that Direct File has paved the way, I hope that more of the IRS’s code, paid for with taxpayer dollars, will soon be available to all of us.

Open sourcing Direct File has long been planned, and even longer desired. Explaining last May why open source is particularly important for Direct File, the team wrote:

The IRS could take further steps to build public trust and enable independent assessment of its work. The Direct File product team was given the mandate to develop software that ensures every taxpayer receives the full benefit of any tax provisions for which they are eligible. Releasing components of Direct File as open-source software would enable the team to demonstrate this commitment.

Establishing trust with taxpayers was core to our approach for designing and building Direct File. By creating the most accurate option for filing, by making taxes accessible to all, by keeping taxpayer data secure, and now, by publicly sharing Direct File’s code, the Direct File team showed our dedication to earning taxpayers’ trust.

Please note: As of two weeks ago, I no longer work at the IRS. I am writing solely in my personal capacity.

Read the whole story
acdha
2 days ago
reply
Washington, DC
rosskarchner
3 days ago
reply
Share this story
Delete

What happens when the intelligence goes out?

1 Share

Brittleness and resiliency. Read here.

Read the whole story
rosskarchner
8 days ago
reply
Share this story
Delete

My Wildly Incorrect Bias About Corporate Engineers

1 Share

Before I went to work for Audible (five years ago now — time flies!) I had a bias about engineers that worked for large corporations. I assumed that they weren’t as good as indies and engineers at small companies, or else they’d actually be indies or work at small shops like Omni.

Obviously I knew there had to be exceptions, particularly at Apple, or else we wouldn’t have had great things like AppKit and UIKit and everything else we’ve built on over these years. But the bias persisted.

* * *

Before Audible, the largest company I’d ever worked at (Newsgator) had just over 100 people. When I worked at Omni it had roughly half that number.

I’ve spent half my career working at even smaller companies, with just me and Sheila (Ranchero Software) or at places with three people (Q Branch) or like six people (UserLand Software).

And of course I was arrogant enough to think that I was better — much better — than any corporate engineer. While a corporate engineer might own some small part of an app or framework — or just a single button, as the (lame) joke went back in the day — I was shipping entire apps on my own or with a very small team. Popular, valuable, newsworthy apps that people loved.

And I wasn’t the only one: think of Flying Meat, Rogue Amoeba, Bare Bones, Red Sweater, The Iconfactory and many more.

* * *

And so I learned very quickly when I started at Audible that I was very wrong. I was impressed, and grew more impressed as time went on, by my fellow engineers’ rigor, talent, professionalism, care, and, especially, ability to work with other people toward common goals.

While I’m the die-hard introvert who just wants to go into a room and sit in front of a Mac and write some code and get things done, I learned that my co-workers — even if they, like me, kinda just wanted to sit and write code — were great at app development as a team sport. I was impressed with how they wanted to grow and did grow — always leveling-up their individual skills and their ability to work on a team and across teams.

And what a team it was! It’s not a new observation, but the indies I mentioned above, and the ones I didn’t, tend to be white men born in the United States — the people who could most afford to fail, in other words, because for them (for me, absolutely) there’s always another opportunity.

My team didn’t look like that — it was quite a contrast with my previous experience. Many more women, people of color, people born outside the United States. (But note that there’s always more progress to be made!)

The engineers on my team could write apps as well, if not better in many cases, than the indies I know. And the ones who aren’t quite there yet — well, just give them a little more time. They’ve all given me reason to believe in them.

I regret my bias about engineers working in corporate environments, and I’m so glad I learned the truth almost from day one on starting at Audible.

* * *

For a couple years I did a lot of hiring — a lot of interviews — at Audible. And I noticed something: there was a strong correlation between being hirable and having worked with other people.

The folks who’d worked largely by themselves, or on just one small team, weren’t as good candidates as the folks who’d worked with more people. This, of course, went against my original bias that indies are the best engineers — but by then I knew that a candidate who’d worked with lots of other people had been exposed to more code, more dilemmas, more challenges (technical and human), and they were not just more ready to work on a larger team but more knowledgeable. Even their individual skills were greater.

Advice time: if you’re a newer engineer, find ways to work with other people. Not just because you’re more likely to get hired at a place like Audible — but because, no matter where you want to work, you’ll be better at it.

You can’t just sit alone in front of your computer all day and write code and expect to be a great engineer.

Lesson learned!

* * *

With retirement imminent — this is my last job, and June 6 is my last day (maybe I’ve buried the lede here) — I want to thank my team publicly for how they’ve made me a better engineer and, more importantly, a better person. From the bottom of my heart.

I learned more from them than I could ever have taught; I got the better part of this deal.

Thank you, team! So much. ❤️

Read the whole story
rosskarchner
13 days ago
reply
Share this story
Delete
Next Page of Stories