In this talk, I will present a systematic evaluation of twelve multimodal large language models (MLLMs) in understanding statistical maps. Statistical maps have traditionally served as sources of information acquired, analyzed, and interpreted by humans. The recent development of MLLMs that handle both visual and textual input opens new possibilities to process spatial information by machines.
Building on a classic cartographic theoretical framework, I tested models across three dimensions of cartographic thought: map reading (recognizing symbols and extracting values), map analysis (identifying spatial patterns), and map interpretation (contextual reasoning). I explored models' performance across various map properties, including graphical complexity, spatial unit aggregation, map source, and visualization techniques.
The study shows an inverse hierarchy in the capabilities of MLLMs: models were best at interpretation tasks (mean = 3.69/5), weaker at analytical pattern discernment (3.03/5), and worst at accurate value extraction during map reading (2.83/5). MLLMs compensate for visual processing limitations through extensive pre-trained geographic knowledge, representing a fundamentally different cognitive pathway than human map reading. Generally, commercial models outperformed free alternatives, and country-level maps with single visualization techniques yielded superior results over complex regional maps.