The Data Studio

Is Blockchain The Answer?

If you look at any indicator of what the data management world is interested in right now, blockchain will be at the top of the list. Internet searches, pronouncements by CEOs, new books published, podcasts, major software companies, industry analysts, etc. All seem to be in a blockchain frenzy.

Let's look at what blockchains do. Bitcoin is what made blockchain famous. It is not the only blockchain application, but it is the model for the others, so we will refer to Bitcoin in this description.

Blockchain Features

Simple Chain Structure

The chain structure has been used for at least 50 years in data management software. The simple chain shown here is not "blockchain" (we'll explain how blockchain is different very soon) but it is useful to understand the simple chain structure first.

A Simple Chain

Each "block" or "record" is linked to the previous one by having a Pointer value that matches the Block ID of the previous block. (In the 50-year old databases, there would normally be forward pointers as well as backward pointers, and the Block ID would provide some way to access the block efficiently, sometimes being a block address on the disk, sometimes going via an index.)

If we want to change a simple chain, we can. Here's an illustration of a few changes made to the chain shown above:

Changed Chain

Here we have added a new block in the middle of the chain, changing the Pointer in Block ID "R" to make this work. We have also changed the transaction number of the second transaction in Block ID "Q", and added a new transaction in Block ID "R".

Blockchain wants to make the chain "immutable" - once a block has been added to the chain it cannot be changed. Blockchain achieves this by making it very expensive to change the chain, as we will see soon.

Hashing

Next we have to understand hashing.

A hash is a value that is calculated by applying some function to the series of bytes that make up a field, a record or even a whole file. The hash value is usually a number. Using a particular hash function the same input will always give the same output. Different inputs can give the same hash value, so you can never take the hash value and work out the input from it because there are many possible answers (an infinite number of answers for most hash algorithms).

Let's start with a very simple example:

We could take the ASCII value of each character in the input, add them all up, divide the result by 256 and take the remainder as our hash. This is what we get for a few different strings:

Input ASCII values Sum of ASCII values Remainder (our Hash)
Ron 82, 111, 110 303 47
Relational Databases For Agile Developers 82, 101, 108, 97, 116, 105, 111, 110,
97, 108, 32, 68, 97, 116, 97, 98, 97,
115, 101, 115, 32, 70, 111, 114, 32,
65, 103, 105, 108, 101, 32, 68, 101,
118, 101, 108, 111, 112, 101, 114, 115
3893 53
Netezza 78, 101, 116, 101, 122, 122, 97 737 225
NonStop SQL 78, 111, 110, 83, 116, 111, 112, 32, 83, 81, 76 993 225

This hash function is very simplistic and can produce only 256 values, so we do get some collisions (as in the last two rows of this table).

The hash function used in Blockchains is usually SHA-256. This gives 1077 possible values so the chance of a collision is very, very small.

The input string can be of any length. In fact, it is common to feed all the bytes of a file into a hash function. You will probably have seen MD5 hashes provided for some downloads. These are used to check the integrity of the downloaded file and make sure that it wasn't corrupted on its way from the server, across the internet to your disk. If just one byte in the file is changed, you will get a different MD5 hash, so you can be very confident that the file you downloaded was valid if the MD5 hash of your copy is the same as the one on the server you got the file from.

Hashes are often used for checking integrity. The last digit of your credit card is a single-digit hash (usually called a check-digit) and is produced using the Luhn algorithm.

Hashes are fundamental to blockchain implementations.

A Blockchain

Now we can turn our simple chain into a blockchain.

This example is simplified but it still shows the main feature that makes blockchains "immutable".

The first step is to make our Block ID "A" the genesis block. We have to make a SHA-256 hash of the contents of the block:

"Block ID|A|Transaction|001|Transaction|002|Transaction|003|Pointer||"

gives us the SHA-256 hash:

dfdba2bdb97e68127d8175ea200be502c192f94cd251e5a5024aed96fb72874e

We now use this as the identifier of the block.

We use the Block hash of the genesis block as the Pointer in the next block (our Block ID "Q").

So now we make a SHA-256 of Block ID "Q" including the Pointer to the genesis block:

"Block ID|Q|Transaction|004|Transaction|005|Pointer|dfdba2bdb97e68127d8175ea200be502c192f94cd251e5a5024aed96fb72874e|"

gives us the SHA-256 hash:

3c4374099d09d10d36545c5bf10db1eb2dbe36b936312b95ce9803c923d82c60

We use the hash of Block ID "Q" as the Pointer in Block ID "R". We continue up the chain like this:

Blockchain

Just as the MD5 hash on a downloaded file lets you know if the file has been corrupted, so the SHA-256 hash on each block in a blockchain lets you know if the contents of the block have been changed. Suppose we change just one byte in the data of Block ID "Q". Then the hash of Block ID "Q" will have to change to make this block valid.

Now Block ID "Q" has a different Block hash. So to make Block ID "R" valid, we have to change its Pointer so that it points to our new version of Block ID "Q".

Now we have to calculate a new Block hash for Block ID "R".

That means we have to change the pointer in Block ID "Z", and so on until the end of the chain.

So you can change a blockchain, but if you do, then you have to change every block that follows the one that you want to change.

This would be very expensive, but since a significant selling point of blockchain is that it is immutable, there are some other requirements that make it even more difficult to change. These include "proof of work" and a "peer-to-peer network" to validate the blockchain.

The "proof of work" is an arbitrary calculation that is done for every block that is added. The calculation is what produces the block hash, and it is more complicated than I have shown above. In fact, many hash calculations are required to produce one block. Bitcoin makes it even more costly to produce the block hash by putting constraints on the hash that is generated. The SHA-256 hash produces a number in the range 0 to 2256-1. Bitcoin sets a limit on the generated number, and peers have to generate a hash below this value. They do this by tweaking a special field in the block header (unfortunately called nonce) every time they generate the hash until they get a hash value below the limit. The first peer to do this successfully gets to create the block, and claim the commission. The limit is adjusted every 2016 blocks to keep the time required to generate a new block roughly constant.

The "peer-to-peer network" is a network of anonymous nodes that all compete to create blocks. There are honest peers (we hope) and there may be untrustworthy peers. The system is set up so that untrustworthy peers who try to modify the blockchain will have more work to do than honest peers.

The peers are also responsible for validating the blocks. Part of this is checking that the contents of the block match its hash. The validation is purely of structure of the transactions and the hashes of course. The peers have no way of knowing that the data in the blockchain represents a real event in the real world.

Should You Use Blockchain For Your System?

There are many problems, I believe.

Sustainability

The Data Studio

Blockchain tries to enforce immutability and the integrity of the blockchain by massive and unnecessary processing. The result is that the organisations that create new blocks ("miners" - a term used to make the process sound exciting and lucrative) are based where there is a large and cheap supply of electricity. The machines used by these organisations are extremely powerful because the fastest processors earn the most commission.

Right now blockchain is tiny by the processing standards of most large organisations in business or government. Bitcoin adds about 65,000 blocks a year to its blockchain. There can be up to about 2,000 transactions in a Bitcoin block. So that's up to 130 million transactions a year. I recently worked in a finance organisation that did 130 million transactions every 4 days, and that's just one company. About 3 transactions a second are processed in Bitcoin. Most relational databases can process thousands of transactions a second. The fastest I know of (VoltDB) processes millions of transactions a second. You can see Bitcoin's own records of scale and performance here.

Blockchain "miners" are in a massive international arms-race with one another. Those who can deploy the most computing power get to add the most blocks to the blockchain and earn the most Bitcoins.

Proponents of Bitcoin say that it will get more efficient and so will the machines that do the work. But they also say that the number and size of blockchain databases will increase dramatically. If the technology is used as widely as the hype suggests then Bitcoin, and blockchain generally, will be significant factors in the climate change disaster because they use so much power.

A typical petrol-driven car has a carbon footprint of about ¼kg of CO2 per mile. A Bitcoin transaction has a carbon footprint of 254kg of CO2. So one Bitcoin transaction has a carbon footprint equivalent to driving 1,000 miles. We don't want to be doing billions of Bitcoin transactions.

This is enough of a reason to put blockchain in the technology wastebin right now, as far as I am concerned.

On 18th June, 2019, Facebook announced its own digital currency that "will let billions of users make transactions". Facebook currently has 2.38 billion monthly active users. For comparison, in the UK there are about 60 million payment card transactions every day, almost one for every person in the UK. That does not include online transactions. What all this tells us is that there is potential for Facebook to take so many transactions that its is likely to to be much bigger than the curent Bitcoin blockchain size, orders of magnitude bigger.

Since Facebook is planning a private blockchain, there are opportunities to avoid some of the most expensive processing that Bitcoin and other public blockchains use: the "proof-of-work" requirements. That doesn't let Facebook off the sustainability hook. Already we a seeing a boost in Bitcoin, attributed to Facebook's stated interest in the blockchain technology. Bitcoin is already using more electricity than the Czech Republic and Columbia. We need to stop it, not encourage it.

We don't know the exact details of Facebook's implementation yet. If it is a real blockchain then it will be an unmitigated climate change disaster. If Facebook is just using the name "blockchain" to excite the hype-fueled investors, then it may not be so bad, but it is still encouraging other use of blockchain, so it will be anything but harmless.

The hype for Bitcoin and blockchain is incredible. Even calling Bitcoin a cryptocurrency is hugely misleading. The blockchain is not encrypted! It uses a technique of hashing (SHA-256) that is used a lot in cryptography, but blockchain is using the hash for a different purpose: simply to detect that a change has taken place. As for "smart contracts", you don't need blockchain to make smart contracts; they can be implemented more effectively with relational database technology. So "smart contracts" are not a feature or a benefit of blockchain. To say they are, is just marketing hype.

The Data Studio

Limited Functionality

It really is stretching the language of information technology to call blockchain a database, but many sources do. Blockchain is a very simple data structure that can hold a few simple transaction types. The whole thing is a single chain for each application. Bitcoin is one such application, other cryptocurrencies are blockchain applications and there are a few others, mostly accounting ledger applications.

Any of the analysis of the data in a blockchain requires reading through the whole one-dimensional blockchain, backwards, because that is how the links in the chain work. Each block has up to 2,000 transactions (some implementations have more) so after reading a block, it is then necessary to pull apart the transactions. There are several possible structures for transactions, so you have to know what those are to get at the individual fields within a transaction. It's all low-level tedious coding.

Unnecessary Complexity

There are many examples of unnecessary complexity in blockchain implementations. They vary from repeated combining and hashing of values, to storing numbers backwards and forwards.

There is no benefit in hashing a group of SHA-256 hashes to produce another SHA-256 hash. The hash is a one-way function because an infinite number of different strings can produce the same hash. If you could run the hash in reverse you would end up with an infinite number of answers and you would have no way of knowing which one produced the particular hash value in the first place. That's OK, because that is what hashes are designed to do, but hashing a hashed value does not do anything useful, and it's an expensive process.

Some numbers in blockchain are stored "little-endian" and some are stored "big-endian". Big-endian means you read the number from left to right and little-endian means you read it from right to left. Unless you just like to be confused, there is no merit in using both conventions in the same system.

Appalling Performance By Design

We've spent years trying (and succeeding) to improve performance. Now blockchain is deliberately slowing processing, by wasting huge amounts of computer power, for the purpose of making the blockchain secure. This just leads to a computing-power arms race.

Back in the 1970s I worked on a database that used chained records as a key part of its storage mechanism. In that database a customer account record would be accessed directly, usually in one disk read. Then that account would have all its transactions in a chain spread over many disk blocks, so to access a record in the chain we had to read all the blocks before the block containing the record we wanted. This could get pretty slow, but we had one chain per account; blockchain has one chain for everybody. With fast disks (possibly solid-state disks) Bitcoin can read through its chain more quickly than we could, but their chain is much, much longer. If we get to Facebook-scale and they actually implement a blockchain, then it will take hours to find a particular record. The more people use this blockchain, the longer the chain will get and the more inefficient it will get. We have been working for decades to eliminate bottlenecks like this and now we are creating them deliberately.

Security

Someone with a lot of money and access to a lot of electricity, can corrupt the blockchain. The people with the most money are the ones most likely to rip-off the rest of us, so this is just an invitation to large-scale organised crime. There have already been some spectacular (and successful) attacks on blockchain-based cryptocurrencies.

There are cheaper and more effective ways of building a secure system.

The Stated Goal Of Blockchain

Blockchain proponents suggest that there is some advantage in having your money tracked by a bunch of anonymous peers, rather than by conventional banks. They say blockchain makes this easier, cheaper, secure, fast and anonymous for the person making a transaction. The "anonymous" bit is true and that makes Bitcoin and other cryptocurrencies attractive to people wanting to do illegal transactions. That is not a good thing. Hiding financial transactions is a cause of much injustice in the world.

The other benefits are not supported by evidence. Why is a bunch of anonymous peers better than my bank? Banks do bad things as we know, but they are accountable. They can be confronted with their crimes, made to pay compensation and fines, and even closed down. We can't do that if we don't know who is responsible. The anonymous peers who get control of cryptocurrencies are those with the most money, because they can afford the most powerful processing systems and make sure that their updates to the blockchain win most of the time. Those with the most money are not often the most honest.

Confidence

Blockchain validates the structure of blocks and transactions. There is no way for blockchain to verify that the transactions it stores represent real events in the real world. "Proving" that the transaction is valid could be an illusion, and a dangerous one, leading to over-confidence in the data.

What Would I Do?

For most applications, I would use a relational database. If I needed a secure ledger I would put it on a secure server that I controlled (not in The Cloud). Then I would use the standard database security to control access. I would have an "insert-only" ledger that would consist of one, or a small number of database tables. The ledger is insert-only because an accounting system should provide a complete and accurate audit trail. If a transaction is wrong then it should not be changed, because then we will not know what happened. Instead it must be corrected by another transaction so that we can see exactly what happened. Mistakes will occur and they need to be put right so that we can see the honest and open story of what happened and how a problem was dealt with. Making these tables "insert-only" is easy with the standard permissions system built into the database.

There might be times when it is necessary to change the ledger. These could occur because of an upgrade to the application, or to fix the result of a software error that had recorded wrong information in the ledger. It could also be because of data protection laws, such as the right to be forgotten. The change would be run by a system administrator with privilege to override the standard permissions. The change itself would be tested and would provide its own audit trail.

With this relational database it would be possible to keep records of all the data related to the financial transactions as well as the transactions themselves. Users with sufficient privilege would be able to query the database to provide standard accounting reports and ad-hoc analyses of the performance of the company, or particular products or particular groups of customers or many other things. The data would be stored in a data model that reflected the real world, and the metadata describing the data model would be held automatically by the databases. The database would then validate the format and content of all data added, automatically.

The relational database would be able to support applications that did automated payments, to whatever level of complexity the business required. Having to write a "smart contract" in blockchain is at least as much work, almost certainly more work, than implementing the same contract in a database application.

And finally (for now) you could use your ledger for many other things, such as easy and efficient reports, so that you and others in your organisation could inquire about everything from a particular transaction to the financial trends of your organisation over months and years. That would not be easy with blockchain.