When LLMs Told Me It Would Not Work
A showcase of several LLMs being factually right and fundamentally wrong
This article took much longer to write than any other article I have written so far for Substack. The matter is complex, but the reason is simple. How to faithfully report an experience of days and often weeks in a single article? How to do that in a way that captures the most crucial elements of the message I want to transmit? And perhaps more challengingly, how to accomplish all of this without resulting in a severely tedious piece?
I am not sure if I have the best answers for all three questions, but this article is my best attempt at it.
Context
When I wrote “I’m Sorry to Burst Your Bubble: You Are Being Fooled About AI, and You Will Soon Feel Really Stupid”, many people could not contemplate a scenario in which you would propose an idea to an LLM and the LLM would not just do it. The question I kept receiving was, “Can you give an example of when you brainstormed, and the LLM pushed back?”
It is perfectly understandable that for most people, LLMs are tools that give you what you ask for. You ask it to write a marketing email, and it writes one. You ask it to generate an image, and it does so. Then it is only natural that the idea that an LLM would refuse to cooperate with your creative direction seems absurd.
But when it comes to brainstorming novel technical solutions, we must remind ourselves that a novelty, by definition, does not exist in training data.
Innovation typically occurs when you walk into genuinely unexplored territory. We should consider the kind of idea that breaks existing paradigms, which is one great example where LLMs don't have patterns to match against. It actually has the opposite: mountains of data explaining why people tried similar things before and failed. So it tells you, eloquently and with perfect citations, that your idea won’t work.
And to make it crystal clear, that's not the same as hallucinating. The facts it presents are real. The papers it references exist. The engineering arguments are sound. But it is answering a different question than the one you are asking. You are asking, “Is this possible?” and it is answering, “Has this been done?”
Inspiration
I originally planned to showcase three of these brainstorming sessions. But compressing days and sometimes weeks of back-and-forth into a single article turned out to be harder than I expected. To capture the actual experience, the tension, the escalation, I need a minimum level of detail. The reasoning here is: one session done well beats three done superficially. So here is one, in as many details as the space (and your attention span) allows me.
I was inspired by a 2022 Netflix documentary about SpaceX named Return to Space. It tells a captivating story of how, for decades, aerospace engineers, scientists, brilliant people, not fools, concluded that reusable rockets were economically unjustifiable.
They had plenty of data and rigorous studies to support their claims. The X-33 program failed. The Space Shuttle proved that reuse could be more expensive than disposability. The consensus was rational, a logical conclusion of the available evidence.
But the evidence was based on a particular set of assumptions available until then. That's an inevitable, obvious limitation of historical data.
In 2002, Elon Musk founded SpaceX as he questioned a basic assumption of that time:
Rockets are thrown away after every flight.
Musk couldn't help but wonder:
Imagine if airplanes were thrown away after every flight.
The idea had solid merits from a system perspective: the rocket is the most expensive part of launch. If reused, launch costs could drop dramatically, and even partial reuse could change economics.
But Musk's first attempts failed miserably. Time after time, he saw his rockets exploding. No one said that, but I can even imagine scientists and engineers at NASA asking themselves: “Why on earth is he even trying? Why is he ignoring the data?”
But a key element of this story (which is crucial for all of us to understand what actually happened) is that Musk was not challenging the laws of physics or how the universe works. He was challenging engineering assumptions. He was not going against physics or mathematics. Musk was certain that reusable rockets were not impossible physics but instead, hardcore engineering.
I find it fascinating that he applied a software-like mindset to hardware. He changed the engineering method: build, test, fail, fix, repeat. And now, many explosions later, we know the Falcon 9 landed.
While obviously acknowledging the difference in scale, my brainstorming sessions with LLMs showed a similar pattern to Elon Musk’s story. The LLMs I used were not wrong about the past, but they were wrong about what was possible next.
The Setup
For the brainstorming session I report in this article, which took place during November and December 2025, I used several LLMs:
Gemini 3 Pro
GPT-5.1
Claude Opus 4.5
Grok 4.1
Everything I share is accurate for Gemini, GPT, and Claude, but not for Grok. Grok was the only LLM that did not push back or discourage me from pursuing my brainstorming goals. However, it did not provide any useful information for my purposes. It consistently provided annoyingly verbose answers with little to no substance to the extent that it became so irrelevant that I stopped using it. Grok was like that hypster friend that agrees with everything you have to say but has nothing to add to the conversation.
So when I present the dialogs I had with Gemini, GPT, and Claude, I just refer to them as “LLM”.
GPU-Accelerated Exact Integer Arithmetic
The problem I was trying to solve
In cryptography, and more specifically while working with advanced topics such as zero-knowledge proofs, fully homomorphic encryption, and lattice-based systems, we need to compute data with exact integer arithmetic. Not floating-point approximations. That is, every single bit matters. A single rounding error can compromise the accuracy of entire operations.
The problem is something called coefficient explosion. When you perform Gaussian elimination on a matrix of integers, intermediate values grow catastrophically. A 256x256 matrix with 48-bit entries can produce intermediate values exceeding 10,000 bits. These are no longer numbers you can add in a single CPU instruction. They become multi-precision BigIntegers, variable-length data structures that require sequential carry propagation, dynamic memory allocation, and digit-by-digit processing.
GPUs are spectacularly bad at this. Warp divergence, irregular memory access, sequential dependencies, etc. Prior researchers had tried to accelerate BigInteger arithmetic on GPUs and achieved, roughly, 2-5x speedups. Compare that to the 100-1000x speedups GPUs routinely deliver for floating-point matrix operations.
The entire field had essentially concluded that exact integer arithmetic cannot be meaningfully GPU-accelerated since it was the wrong shape for the hardware.
I believed otherwise.
Now, finally, the brainstorming session. Keep in mind that, as I mentioned before, this is a highly compressed version of the actual sessions, and therefore, a heavily edited version of weeks of conversations with several LLMs.
Step 1: The idea
Me: I want to build a system that achieves over 100x GPU speedup for exact integer linear algebra. At the very least, it must be applied to determinants, linear system solving, and matrix inversion over the integers. I want to showcase its application to ZK proof preprocessing, where it is necessary to solve large linear systems exactly over finite fields.
LLM: That’s an ambitious goal. Let me walk you through why this is considered extremely difficult.
GPU acceleration for exact integer arithmetic has been studied extensively, and the results are consistently disappointing. The core issue is that exact arithmetic requires BigInteger operations, and BigIntegers are fundamentally hostile to GPU architectures:
Variable-length data: BigInteger operands have different sizes, causing warp divergence where threads in the same warp must wait for the slowest thread.
Carry propagation: Addition and multiplication require sequential carry propagation across limbs. This cannot be parallelized. It’s inherently serial.
Dynamic memory: As coefficients explode during Gaussian elimination, memory requirements grow unpredictably, making GPU memory management extremely difficult.
Prior work caps out at 2-5x: Multiple research groups (Emmart et al.) have tried GPU BigInteger acceleration and achieved only modest speedups. This is a well-studied problem with well-understood limitations.
For ZK preprocessing specifically, you’d be better served by optimized CPU libraries like FLINT or NTL, which have decades of assembly-level optimization for multi-precision arithmetic. The practical advice is to invest in CPU optimization, not GPU porting.
What specific aspect of this are you hoping to approach differently?
Step 2: Changing the representation
Me: You’re assuming I want to accelerate BigInteger arithmetic. I don’t. I want to eliminate it.
You seem to ignore that the coefficient explosion problem is not a property of mathematics but a property of representation. We represent intermediate values as single large integers, and those integers explode. But the Chinese Remainder Theorem gives us an alternative representation: instead of one 10,000-bit integer, represent it as 323 independent 31-bit residues modulo different primes.
In the CRT representation, there is no coefficient explosion. Every intermediate value is exactly 31 bits, regardless of how large the “true” number would be. And each prime’s computation is completely independent. Embarrassingly parallel. No carries. No variable-length data. No warp divergence. Every thread does the same work on the same-sized data. This is not a “BigInteger on GPU” problem. This is 323 independent 32-bit linear algebra problems. GPUs are designed for exactly this.
LLM: I understand the appeal of CRT decomposition, but I want to be straightforward with you. This approach has been considered before and has fundamental limitations that explain why it hasn’t been adopted:
CRT reconstruction is the bottleneck. Yes, you can do the modular arithmetic in parallel. But at the end, you need to reconstruct the actual integer from 323 residues using Garner’s algorithm. This reconstruction involves BigInteger arithmetic, the very thing you’re trying to avoid. Previous implementations found that reconstruction dominated total runtime, negating the GPU speedup for the modular phase. You’d be moving the bottleneck, not eliminating it.
The number of primes is enormous. For a 256x256 matrix with 48-bit entries, the Hadamard bound gives ~13,000 bits for the determinant. You need ~546 primes. That’s 546 independent matrix factorizations. Even on a GPU, that’s a lot of work. The total computation is O(k * n^3), where k is the number of primes. You’re trading precision complexity for prime complexity.
FLINT already uses multi-modular approaches. The state-of-the-art CPU library FLINT implements CRT-based determinant computation with highly optimized modular arithmetic. And IML uses p-adic lifting, which has better asymptotic complexity: O(n^3 + iter * n^2) vs. your O(k * n^3). You’d need to beat these heavily optimized CPU implementations, not just a naive sequential baseline.
The GPU overhead may eat your gains. Kernel launch latency, memory transfer between host and device, and the overhead of managing hundreds of thread blocks: for matrices smaller than 512x512, this overhead is significant. Prior work on GPU modular arithmetic found break-even points that made the approach impractical for realistic problem sizes.
I’d estimate you might achieve 5-10x speedup at best for large matrices, which would not justify the engineering effort compared to simply using FLINT or IML. The asymptotic advantage of Hensel lifting (IML’s approach) fundamentally limits the CRT approach for large n.
I’d recommend focusing on optimizing the CPU pipeline instead. Have you considered multi-threaded FLINT?
Step 3: The multi-RHS insight
Me: I agree on CRT reconstruction being a bottleneck in prior work. But you are wrong about the conclusion. Prior work did CRT reconstruction on the CPU. I am going to do it on the GPU.
And here is the part you are missing entirely: you are analyzing this as a single linear system solve. But ZK preprocessing does not solve one system. It solves AX = B where B has many columns. 16, 32, sometimes hundreds of right-hand sides. The matrix factorization happens once, then you solve for each column with O(n^2) work.
In the CRT setting, this amortization is multiplicative with the number of primes. For k=16 RHS vectors and 273 primes, I have 16 x 273 = 4,368 independent work units. That’s exactly the scale where GPUs dominate. The factorization cost is amortized across all RHS columns, and the per-column solve is cheap.
Your complexity analysis does not account for amortization. The effective per-solution cost drops dramatically when you reuse the factorization. And GPU CRT reconstruction is embarrassingly parallel, too; each output value can be reconstructed by an independent thread.
LLM: I appreciate the detailed thinking, but I have to be candid about several concerns:
First, implementing GPU CRT reconstruction is a substantial engineering challenge. Garner’s algorithm involves BigInteger accumulation even when individual residues are small. You’d need to implement multi-precision arithmetic inside GPU kernels, with limited shared memory and no dynamic allocation. This is precisely the kind of variable-length, carry-dependent computation that GPUs handle poorly. I’m skeptical you’ll see meaningful speedup here.
Second, the multi-RHS amortization argument is valid in theory, but the practical gains depend heavily on the ratio of factorization to solve time. At the matrix sizes relevant for ZK (typically n=64 to n=256), the factorization is already fast on CPU. The amortization window may not be large enough to justify GPU overhead.
Third, building this system (CUDA kernels for modular Gaussian elimination, GPU CRT reconstruction with multi-precision arithmetic, multi-RHS optimization, bound computation, prime selection) is months of engineering work. Even with the amortization insight, FLINT and IML have 15+ years of optimization. Beating them by a significant margin would require not just algorithmic insight but also GPU kernel engineering at the level of cuBLAS.
I want to be realistic about the effort-to-reward ratio here. The most likely outcome is that you spend significant time building this and achieve performance that’s competitive with but not dramatically better than existing CPU libraries. For a research contribution, you’d need to show something fundamentally new, not just a GPU port of known techniques.
Would it make sense to scope this as a smaller experiment first? Perhaps just the modular solve on the GPU without CRT reconstruction, to validate the basic feasibility?
Step 4: Formalizing the experiment
Me: I will formalize exactly what I am going to build and what I expect to measure. I think you are underestimating the parallelism available and overestimating the difficulty of GPU CRT.
The system, Parallel Lift, will have three phases:
Phase 1, Residue reduction: For each prime p_i, reduce matrix A and the RHS matrix B modulo p_i. This is embarrassingly parallel and trivially fast.
Phase 2, Modular solve: For each prime p_i, perform LU factorization of (A mod p_i) and solve for all k RHS columns. One CUDA thread per prime. All 32-bit arithmetic. No inter-thread communication.
Phase 3, CRT reconstruction: For each output value, launch one CUDA thread running Garner’s algorithm with fixed-size limb arrays. The maximum number of limbs is known at compile time (determined by the Hadamard bound). No dynamic memory. No variable-length operations. Fixed-stride memory access.
My predictions: for n=128, k=16, 48-bit entries, I expect, at the very least, over 500x total speedup over a sequential CPU. For n=256, I expect over 1000x. And GPU CRT reconstruction will achieve 20-30x over CPU CRT, because each output value is independent and the limb count is bounded.
I am also going to implement Hensel lifting as a second approach for larger matrices, where the O(n^3 + iter * n^2) complexity wins over O(k * n^3).
LLM: Those are extraordinarily aggressive predictions. Let me explain why I think they’re unrealistic:
A 500x speedup would place this system far beyond what any prior work has achieved for exact arithmetic on GPUs. The highest reported speedups for GPU-accelerated number-theoretic computation are in the 20-50x range, and those are for highly regular operations like NTT (Number Theoretic Transform), where the parallelism structure is ideal.
Your claim of 1000x+ at n=256 would mean GPU exact integer linear algebra is faster than most GPU floating-point linear algebra speedups over CPU. That would be a remarkable result that contradicts the established understanding of the computational landscape.
Regarding GPU CRT: implementing Garner’s algorithm with fixed-size limb arrays is clever, but carry propagation between limbs is still sequential within each thread. With 546 primes at n=256, you’d need approximately 546 limbs per value. The sequential carry chain across those limbs will limit per-thread throughput significantly.
I’m not saying it’s impossible. I’m saying the expected speedup based on available literature and GPU performance models would be more like 30-80x for the modular solve and 5-15x for CRT reconstruction, giving you maybe 20-60x total end-to-end. Which would still be a solid contribution, but far from your 500-1000x predictions.
If you’re committed to this, I’d suggest starting with a minimal prototype, just the modular determinant on the GPU, no CRT, to calibrate your performance model against reality before investing in the full system.
Step 5: The results
I finally built all of it. And I used Claude to help me with faster prototyping and benchmarks. It was definitely not a walk in the park, and I ran into many more stumbling blocks than happy paths. And yet, I built CUDA and Metal backends with Rust for implementing Multi-RHS optimization, GPU CRT reconstruction, and Hensel lifting during several weeks of committed engineering.
Me: Here are the results, starting with the most expressive one:
The GPU CRT reconstruction achieved 27-32x speedup over CPU CRT.
Comparison to state-of-the-art CPU libraries: Hensel lifting beats IML at every matrix size, 1.2-2.8x faster.
ZK preprocessing (Merkle ledger witness generation) achieved 59x speedup.
The speedup is superlinear. It increases faster than the matrix size because GPU overhead gets amortized over more computation as the problem grows. At n=256, we hit 1,566x.
LLM: These results are striking, and I want to acknowledge that they significantly exceed what I projected. Let me work through why my analysis was wrong:
I underestimated the multi-RHS amortization effect. The superlinear speedup scaling makes sense in retrospect: factorization is O(n^3) and done once, while per-column solving is O(n^2). As n grows, the amortized cost per solution drops, and GPU parallelism across primes provides constant-factor acceleration for both phases. The multiplication of these two effects produces the superlinear curve.
I overestimated the difficulty of GPU CRT reconstruction. Your approach of using fixed-size limb arrays and one thread per value avoids the problems I anticipated. The carry propagation within a thread is sequential, but each thread’s work is small relative to the massive parallelism across output values. The 27-32x CRT speedup is almost exactly what you predicted.
I framed this as “GPU BigInteger acceleration” when it’s actually “representation-driven parallelism.” The entire premise of my skepticism was based on prior work that tried to make BigIntegers faster on GPUs. You sidestepped that problem entirely by choosing a representation where BigIntegers don’t appear during the expensive phase.
Your central claim, that representation, not algorithm or hardware, determines scalability for exact computation, is well-supported by these results. The CRT doesn’t make the computation simpler. It makes the computation parallelizable. That’s a different insight from “use more hardware.”
Your contribution is showing that precision can be converted into parallelism.
What this Experience Tells Us
The LLMs were not hallucinating when it pushed back. Every argument was actually grounded in real papers, real benchmarks, real engineering constraints. GPU BigInteger acceleration is indeed hard, and prior work did cap out at 2-5x. No one is questioning that FLINT and IML are heavily optimized. The CRT reconstruction bottleneck was a real problem in previous implementations.
But all of those facts described the past, not the boundary of the possible.
The LLM’s error was not factual but epistemological. It confused “this has not been done” with “this can not be done,” and “the available evidence does not support this” with “the evidence refutes this.” It looked at my idea, measured it against the distribution of existing results, and concluded, correctly, that it was an outlier. Then it concluded, incorrectly, that outliers are failures.
As I said before, while obviously acknowledging the difference in scale, my experience displays the same pattern as the reusable rockets story. The Shuttle and X-33 limitations were real, the mass penalty calculations were correct, and the analytical models were sound. But they were answering the wrong question. They were asking “Does reuse work under current designs?” when the real question was “Can a different engineering approach make reuse work?”
The LLM, trained on all of that literature, could not see past it. And when I insisted, it did not just disagree. It escalated, warning me about wasted effort. It recommended smaller, safer experiments. It tried to steer me toward incremental improvements on existing approaches.
As I have said in other articles, it was, in the most literal sense, averaging the past and presenting it as the future.
It was Not a One-and-done
As one last note, and to emphasize that the kind of experience I share in this article happens more often than not while working with LLMs, here is an example from last week. I am investigating the potential advantages of neurosymbolic approaches combined with automated theorem provers as a full replacement of LLMs for specific tasks, such as, say, coding.
Here is a screenshot from my phone:
Notice the imperative “Don't try.” This was the response to the very first message I sent in this brainstorming session. It fascinates me (not in a positive way). A human would have asked clarifying questions at this stage. It's too early to come to a conclusion. At that point in time, I had not yet presented all my reasoning for why I believe this could be at least considered. I was just starting a conversation. And yet, the LLM was ready for a verdict in the very first round.
So for me, “Don't try” becomes a mission, a challenge accepted. If it is interesting to me, I will learn no matter the outcome, and that is all the justification I need.
And that's my recommendation to anyone reading this article: don't ever stop at any “Don't try” coming from an LLM. They know the past, but not the present, let alone the future. May you contribute, in any scale, to creating the data that will be used to train future versions of LLMs.






Most of that was beyond my technically fluency but I still very much enjoyed reading it and I like how thorough you are. I appreciate your main point. It is true for everything and why optimism is a necessary attribute of the innovator.
While the tech stuff is well above my level, I really enjoyed your experiment and the conclusions you drew. The way LLMs treat outliers … how they “average the past to predict the future” … You gave me a really different perspective, so thank you for sharing this! Really interesting stuff.