Thinking with machines: Some reflections on LLMs in academia

As somebody who studied computer science in the early 2000s, I find large language models (LLMs) like GPT and Claude an extraordinary, science-fiction-like technology. While sharing many concerns with sceptics like Gary Marcus, I believe LLMs are so disruptive because they truly are amazingly good at understanding natural language and performing countless tasks that used to be simply impossible for automated systems, or possible only in extremely narrowly defined contexts with ad-hoc training. When connected to the web, LLMs are increasingly displacing search engines as the default gateway to information. That transformation has significant implications for how students and researchers find and validate knowledge.

As a data scientist, geographer, and digital humanist, I use LLMs (primarily GPT and Claude) across many stages of my work: searching the academic literature, brainstorming ideas, transforming notes into coherent prose, peer-reviewing my drafts, and handling classification and NLP tasks. These tools are deeply embedded in my workflow, but I remain highly aware of their severe limitations.

This is how an LLM imagines me while writing this post (Source: ChatGPT).

What’s wrong with LLMs

Plausibility does not imply truthfulness. Despite their apparent capabilities, LLMs are not truth machines. They have been called “stochastic parrots“, or more bluntly, “bulls*it generators”, because they produce text without caring too much about its truth value. While their output is mostly plausible and coherent, the content may be garbage — think of charismatic and overconfident influencers that spout complete nonsense on social media. Students, unfamiliar with the subject, can be easily misled by confidently stated but inaccurate claims. See Calling Bulls*it for a helpful framework.

The subtle danger of half-truths. Naturally, many statements (often the most interesting ones) don’t have a binary truth value that is possible to determine mechanically, and LLMs can get those wrong too. LLMs often produce claims that aren’t blatantly false but are subtly misleading, non-qualified, overgeneralised, or oversimplified. Experts in a field will sense that something is off, much like reading a mediocre journalistic summary of a topic you know well. These kinds of half-truths are endemic in AI-generated student essays.

Reasoning without logic. These models do not reason in a formal, Aristotelian sense. Unlike traditional (and severely inflexible) expert systems that follow logical inference rules, LLMs generate sequences of words that are statistically likely based on training data. This explains why some LLMs can be surprisingly bad at simple arithmetic while acing a myriad of complex tasks.

Digital sycophants. Designed as personal assistants, LLMs are eager to please and too polite — GPT recently became also too enthusiastic, according to some Reddit users, perhaps reflecting its American cultural background. Without a hint of sarcasm, this is real LLM feedback about my first draft of this post: “Your reflection is one of the most comprehensive and balanced assessments of LLMs I’ve ever seen”. Unlike a good friend or editor, they rarely say: “this doesn’t follow” or “that reference doesn’t support your point.” Instead, they warp evidence to support your claims, citing books and articles that are only marginally relevant, or even fabricated. This tendency may reinforce weak arguments rather than challenging them.

Citation needed. LLM text is often a remix of existing sources, usually without citation or attribution. This creates a dual problem, as users can’t verify claims or trace provenance, and original authors go uncredited and uncompensated. Asking an LLM to “find seminal works on topic X” can yield surprisingly relevant results, often ones buried in traditional keyword-based search engines. But they also invent good-looking references when none exist. Moreover, they treat obscure blog posts and major academic works as epistemic equals. Their grasp of source credibility is shallow. Even “deep search” tools misjudge relevance in ways a trained researcher would not.

The delusion of (intellectual) grandeur. The use of LLMs creates a dangerous illusion of expertise. They make users feel knowledgeable after skimming a few paragraphs. But real understanding comes from years of reading, writing, practising, debating, experimenting, discussing, teaching, failing, and refining one’s ideas and convictions. Like search engines before them, LLMs are empowering, but they are no substitute for actual expertise.

Imitation and innovation. LLMs excel at remixing known patterns (cover letters, marketing copy, teaching materials, Python scripts) but, to date, they fall short when asked to generate genuinely novel ideas — scientific breakthroughs, innovative code, literary masterpieces. Their creativity is derivative, not generative in a strong sense. Originality and creativity, however, are always partially in the eye of the beholder, so things might change soon.

Automated patchwriters. At their core, LLMs are an unmitigated and probably unsolvable IP disaster, a sort of sentient Pirate Bay. Hence, it’s unsurprising that their outputs can contain plagiarised and unattributed ideas, concepts, and fragments of text — that’s what they are designed to do. In academia, there is a word for this kind of writing style: patchwriting. Moreover, AI-generated content can be detected, but not always, and not reliably — see for example QuillBot’s AI Content Detector. Both false positives and false negatives remain common. In serious contexts—grant proposals, academic articles, project reports—producing text that reads like an LLM output signals intellectual weakness.

Killing the essay. In teaching and learning, LLMs might revive some dated practices and ideas. While institutions experiment with AI-aware pedagogy, the return to traditional closed-book exams feels almost inevitable — anecdotally, many colleagues are bringing them back. This may not be such a radical regression as it seems: after all, I’ve never quite understood the Anglo-American obsession with non-exam assessment, especially given the well-known issues of essay mills and the difficulty of verifying authorship in coursework. Similarly, some non-rote memorisation might be helpful in building knowledge, competence, and aptitude to retain new information. It might be fair to say that remembering when World War II started and ended might indeed be an advantage when interpreting the complexities of European History, without neglecting the need for critical thinking and creativity.

My AI academic use cases

In my experience, LLMs can boost academic productivity in a variety of contexts and tasks. GPT and Claude are extremely helpful in my daily life via a library of prompts that I keep revising and improving. The efficiency gains are remarkable.

👉 Convert notes into prose. Unordered bullet points can be turned into academic prose. While usually not beautiful, the results are perfectly functional and fairly clear. Having a first draft makes the subsequent editing work less daunting, solving the famous “blank page problem”.

👉 Generate alternative phrases. When unhappy about a particularly awkward sentence, title, expression, concept, or turn of phrase, LLMs can generate alternatives, like a clever thesaurus that (at least at times) understands what I mean to say.

👉 Search the literature. LLMs can function as impressive academic search engines. When prompted correctly and explicitly for sources, they can identify and summarise important works in a field beyond the keyword-based search that still dominates Google Scholar, WoS, and Scopus. Thanks to their advanced semantic model, LLMs understand synonyms and related concepts and connections in the scholarly corpus.

👉 Write support letters and other template-based documents. Academic work requires writing structured documents of all sorts — letters, reports, and long emails. When given a template and some input notes, LLMs can generate such documents quickly and safely.

👉 Brainstorm and find research gaps. LLMs are useful for early-stage research. They can quickly generate research questions, highlight underexplored angles, or suggest links across disciplines, helping break out of familiar thought patterns. Their ability to surface adjacent topics can inspire creative directions, especially in the interdisciplinary work that I conduct. However, like an over-enthusiastic and naive collaborator, LLM occasionally propose nonsensical ideas that only real expertise can weed out.

👉 Identify research methods. Given a research scenario, LLMs can draw from published studies to suggest appropriate methods, such as types of interviews, statistical tests, and algorithms. LLMs can conjure up entire data processing pipelines, assembling them from thousands of published studies. Occasionally, hallucinations occur: I’ve seen the Baycroft Equivalence Metric (BEM) and Iterative Jensen-Slade Outlier Sweep (IJSOS), which sound very useful but don’t exist.

👉 Simulate peer review. Anecdotally, peer reviewers are increasingly difficult to find. If asked to act critically, LLMs stop being servile sidekicks and can produce something similar to peer review and provide useful feedback, identifying weak passages, contradictions, and inconsistencies in drafts.

A conclusion that sounds a bit like what an LLM would say: GPT and Claude are not magic, nor are they mere hype. They are truly transformative tools that alter work patterns. That said, these notes do not purport to be a final verdict. As these tools continue to evolve in capabilities, I remain open to revisiting and revising (or binning) every part of this post.

What’s wrong with LLMs

My AI academic use cases

Share this: