Love and Suffering of Archive Nodes
Matthias, also known as „Ghostie“, is knowledgeable about full nodes, including those that strain hard drives, CPUs, and memory. In a guest post, he shares his experiences.
By Ghostie
Operating an archive node for blockchains like Ethereum or Polygon is labor-intensive and costly. It’s more or less the pinnacle of node setups. The node requires good hardware, maintenance (good monitoring), and also time—at least whenever monitoring kicks in or a new release is due. Typically, one only takes this on for a specific reason.
An archive node is more than a full node. A full node „only“ stores the entire blockchain, whereas an archive node additionally stores every state that each smart contract and transaction ever had (i.e., at every block height). However, the additional data of an archive node can be reconstructed from the data of a full node.
Since this often leads to discussions on Twitter or online forums, I like to bring an analogy to Bitcoin: An archive node for Ethereum is akin to a Bitcoin node that stores the UTXO (Unspent Transaction Output) set valid at every block height. This is not particularly interesting for Bitcoin because (a) these data can be reconstructed much faster and (b) they do not hold any special informational value. It’s different with Ethereum because there are smart contracts whose intermediate states are stored, making it possible to directly query, for example, how much USDT an account had at any given time (block height).
I synchronized my first archive node for the Ethereum ecosystem in 2019. It was out of pure curiosity: I had purchased a lot of server hardware for a project (96 cores, 768 GiB RAM, over 40 TiB SSD storage) and wanted to push it to the limit. So, I synchronized an archive node for Ethereum using geth, which took exactly 7 days and 22 hours. I still remember it so precisely because I discussed the topic on coinforum.de and documented my progress.
Long-term, better reasons are needed. That’s a significant difference between Ethereum and Bitcoin. Bitcoiners synchronize the entire blockchain out of idealism. In Ethereum, most users utilize node providers like Infura or Alchemy. These providers allow roughly 100,000 free API calls, sufficient for smaller projects and wallets.
When and for whom is an archive node useful?
It becomes critical when one wants or needs to conduct blockchain analyses. If, for instance, you want to know which accounts held what amounts of a token on Ethereum at any time, you need to scan the blockchain block by block to find all addresses and then query the historical balances for each ERC20 transaction. This quickly exceeds the free API calls provided by Alchemy and Infura and can become expensive fast. In such cases, an archive node is worthwhile because scanning all blocks leads to as many API calls as there are blocks in the blockchain. Ethereum currently has over 20 million blocks.
I now operate such nodes for clients. The reasons can sometimes be absurd. For instance, a company offers assistance in complying with EU regulations, which require demonstrating the CO2 consumption for each crypto asset. If you have a token on Ethereum, you must trace back all past transactions and swaps to calculate the gas used by the token. The gas corresponds to the required computational power and thus serves as an indicator of CO2 consumption.
This is, of course, ludicrous. As a Proof-of-Stake blockchain, Ethereum consumes very little energy, and synchronizing an archive node likely generates far more CO2 than you could ever save through the regulation (if at all). But it’s not my job to ask such questions, but rather to set up nodes.
What resources are needed to operate an Ethereum archive node?
The differences between nodes and clients can be substantial. I have experience with Ethereum, Polygon, and Solana.
Using the Nethermind client, Ethereum requires about 15 terabytes of storage for the execution layer data, 8 CPU cores (preferably 16 for synchronization), and 128 gigabytes of RAM. An alternative client is Erigon, which requires only three terabytes of data. Erigon stores data differently than Nethermind, using a flat architecture that only saves deltas between some blocks. This may result in slightly longer query times for historical data but saves a lot of space.
For a server, the demand is moderate. Once synchronized, less is needed. The biggest challenge is the hard drives, or rather the SSDs. To understand why, you must know how smart contracts work with the Ethereum Virtual Machine (EVM). I’ll keep it brief: Each smart contract has its own virtual memory space, which may load different parts depending on the account. This leads to numerous read and write operations (I/O) with each new block and during synchronization.
Therefore, traditional hard drives (HDDs) are outdated; SSDs are needed. Users of Nethermind must chain several SSDs together to store at least 15 terabytes. This can quickly cost a four-figure amount if done with AWS. With Erigon, you could theoretically synchronize an archive node on a high-performance PC.
Polygon and Solana
And Ethereum is still relatively resource-efficient! Other EVM-compatible blockchains like Polygon or non-EVM-compatible ones like Solana face similar problems: To synchronize, state after state must be calculated and stored, and every smart contract query requires numerous I/O operations.
Imagine what happens if you shorten block times as with Polygon: Even better hard drives are needed to handle the requests. While 6,000 IOPS suffice for Ethereum, 20,000 IOPS are recommended for synchronizing with Erigon on Polygon. This further drives up costs, likely amounting to 8,000 euros a month if adhering to the system requirements for an archive node.
Erigon is typically used as the client for Polygon archive nodes. While it needs only 3 terabytes of storage for Ethereum, it requires 10 terabytes for Polygon. And Polygon is significantly younger than Ethereum. So one can already project how resource consumption will look in the future and how much storage a Polygon archive node would hypothetically need with a different client.
Solana is even more extreme. I once operated two nodes for mainnet and testnet to participate in the Solana Grant. These were not archive nodes but validators. Even validators require a CPU with at least 12 cores and at least 128 GiB RAM for the testnet and 256 GiB RAM for the mainnet (for better context: at least one Solana outage was caused because too many validators had „only“ 128 GiB RAM). The network connection should also be good, with system requirements calling for at least 1 Gbit/s synchronous bandwidth (two validators cause a constant load of about 300 Mbit/s), with a monthly traffic volume of around 100 TiB per validator. Such infrastructure is hard to find in many places in Germany.
A regular Solana node already needs 2 terabytes, and an archive node according to online sources would need about 100 terabytes. The costs are enormous.
Can this work out in the long run?
There are doubts about whether this can work long-term. Personally, I believe in technological progress. For instance, when I synchronized my Geth node in 2019, it took a good week. Today, it takes hardly any longer with more powerful hardware, despite the blockchain being much larger. This is due not only to better hardware but also to improved algorithms, such as the block processing speed in Nethermind v1.26, which is 30 to 50 percent faster.
Hardware and software are becoming increasingly powerful, particularly for servers.
However, even today, investing hypothetically 250,000 euros is enough to operate an archive node for Solana for the next ten years. As long as there’s a reason and a business model, someone will do it. So I’m not worried that archive nodes will cease to exist.
It’s not enough, however, to simply invest more money into hardware for scaling. Once a blockchain becomes widespread, it hits scalability limits on its own. Without many accounts, many projects easily achieve a five- or six-figure transaction per second rate in synthetic benchmarks. This is because accounts and current states easily fit in the CPU cache, and benchmarks are often optimized for parallelism for marketing purposes.
But with more users, like Bitcoin, things get tight. Even here, with only 7 transactions per second, it becomes critical at 4 gigabytes of RAM because the UTXO set is so large. This state (or „state“) is even larger and more complex in other blockchains. That’s why one shouldn’t believe the promises of blockchain developers from the lab. It gets more challenging in practice.
Moreover, there’s a natural barrier: latency. Data traffic requires time to travel from one node to another, defined by the speed of light as a physical limit. A research paper (Information Propagation in the Bitcoin Network) concluded that a block interval of at least 12.6 seconds is required for a block to propagate to most nodes worldwide. This is where Ethereum’s block interval comes from. More modern blockchains like Solana have much shorter intervals, leading nodes to concentrate mainly in Europe, the USA, and data centers.
This brings us back to the old question of decentralization and scalability. With only a few nodes close together, scaling is easier. But you don’t have to be a Bitcoin maximalist to wish for full and archive nodes to be spread around the world.
And no one knows…
Recently, I synchronized another Ethereum (Nethermind) and a Polygon (Erigon) archive node. Both clients had bugs preventing synchronization as an archive node. While a bug sneaked into Nethermind due to the mentioned optimization, which was quickly fixed after a brief apology („yes, we don’t optimize for running as an archive node“), the bug in Erigon remained open, even though a developer was assigned to it three weeks ago. It likely isn’t considered very important.
The bugs were caused by upgrades altering block processing. They aren’t noticed when an archive node is running, nor when synchronizing a regular node. It’s only when you go through the trouble of synchronizing a new archive node that you notice it, sometimes after weeks, depending on the server.
The fact that these bugs existed for weeks shows how rarely someone synchronizes a new archive node. This may lead to such errors going unnoticed for a long time, potentially rendering software synchronization for new archive nodes impossible without anyone realizing it—since it’s so rarely done.
In my case, things worked out well. I reported the issue to Nethermind and helped developers fix it. The error was already known for Erigon but marked for a future version when it occurred with me. I’m now using an older version—the impact is less significant for Polygon. However, if such a bug digs deeper because archive nodes are synchronized even less frequently and the CI/CD of node developers isn’t improved, it might not be as easy to fix.
So, it remains possible—but a challenge that won’t get any easier over time.