During Infura’s four-plus years, we have worked to provide developers with a simplified access path to Ethereum and IPFS. By setting up the first public IPFS APIs and Gateway alongside our Ethereum API, we built a foundational Web3 development suite for building decentralized applications. By supporting IPFS as part of our infrastructure offering, we’ve enabled the storage of documents, art assets, music, videos, social network information, photos, and more for thousands of users, not to mention distributed, secure storage for the front-end of some of the earliest Dapps. Our IPFS service supports pinning and accessing pinned content directly via the Infura API and allows users to access data pinned across the IPFS network via the Gateway. Currently, we host over 74 million unique objects and handle over 4.5 TB of data transfer per day.
This post is a deep-dive into IPFS, how it works, and how it fits into the Web3 stack.
In the last decade, much of the data composing the internet has moved onto "cloud storage." Most of the applications we use on a daily basis store our personal information in data centers owned by Amazon, Google, or Microsoft. In an effort to pioneer a better internet––Web 3.0––developers are turning to decentralized data networks as a way to improve data resiliency and create new models around data ownership. Emerging technologies like the Interplanetary File System (IPFS) allow us to improve Web 2.0’s underlying protocols, making the internet safer and more secure by distributing data across a vast, global network of peers.
IPFS began in 2015 as an effort by Protocol Labs to build a system that could fundamentally change the way information is transmitted across the globe and pave the way for a distributed, more resilient web. IPFS has grown to support an array of different use cases and is improving information management for industries across the spectrum: from disintermediating the music industry to unblocking weather risk protection for agribusiness. Currently, Protocol Labs’ projects include IPFS, the modular protocols and tools that support it, and Filecoin, among others. Between them, these tools serve thousands of organizations and millions of people.
At its core, IPFS is a distributed system for storing and accessing files, websites, applications, and data. It is transport layer agnostic, meaning that it can communicate over various transport layers—including transmission control protocol (TCP), uTP, UDT, QUIC, TOR, and even Bluetooth. IPFS has rules that determine how data and content move around on the network. These rules are similar in nature to Kademlia, the peer-to-peer distributed hash table (DHT) that is widely known for its use in the BitTorrent protocol. This file system layer opens the door for an array of interesting use cases for distributed websites that can run entirely on client side browsers.
Instead of referring to data (photos, articles, videos) by location, or which server they are stored on, IPFS refers to everything by that data’s hash, meaning the content itself. The idea is that if you want to access a particular page from your browser, IPFS will ask the entire network, “does anyone have the data that corresponds to this hash?” A node on IPFS that contains the corresponding hash will return the data, allowing you to access it from anywhere (and potentially even offline).
IPFS uses content addressing the way HTTP uses URLs. This means that instead of creating identifiers that address artifacts by location, we can address them by some representation of the content itself. This content-addressable approach separates the “what” from the “where,” so data and files can be stored and served from anywhere by anyone. It works by taking a file and hashing it cryptographically so you end up with a very small and reproducible representation of the file, which ensures that no one can create another file that has the same hash and use that as the address. Instead of a server, you are talking to a specific piece of data.
HTTP vs. IPFS to find and retrieve a file
HTTP has a helpful property in which the location is in the identifier—this makes it easy to find the computers hosting the file and talk to them. This generally works very well, but not in the offline case or in large distributed scenarios where you want to minimize load across the network. It also means that if a particular server is down, the content it hosts is unavailable.
In IPFS you separate the steps into two parts:
- Identify the file with content addressing, via the hash.
- Ask who has it. When you have the hash, then you ask the network you’re connected to “Who has this content (hash)?” and you connect to the corresponding nodes and download it.
The result is a peer-to-peer overlay that enables very fast routing, not tied to a particular physical location but widely and immediately available. To learn more, check out this overview of how IPFS works or watch this video to learn how IPFS deals with files.
IPFS by Example
IPFS takes the best qualities of well-tested internet technologies such as DHTs and the Git versioning system, while also creating a P2P swarm that allows the exchange of IPFS objects. The totality of IPFS objects forms a cryptographically authenticated data structure known as a Merkle DAG and this data structure can be used to model many other data structures. In this post, we will introduce IPFS objects and the Merkle DAG and give examples of structures that can be modeled using IPFS.
IPFS is essentially a P2P system for retrieving and sharing IPFS objects. An IPFS object is a data structure with two fields:
- Data — a blob of unstructured binary data of size < 256 kB.
- Links — an array of Link structures. These are links to other IPFS objects.
A Link structure has three data fields:
- Name — the name of the Link.
- Hash — the hash of the linked IPFS object.
- Size — the cumulative size of the linked IPFS object, including following its links.
IPFS objects are normally referred to by their Base58 encoded hash. For instance, let’s take a look at the IPFS object with hash QmarHSr9aSNaPSR6G9KFPbuLV9aEqJfTk1y9B8pdwqK4Rq using the IPFS command-line tool (please try this at home!):
> ipfs object get QmarHSr9aSNaPSR6G9KFPbuLV9aEqJfTk1y9B8pdwqK4Rq
“Data”: “Hello World!”}
You may notice that all hashes begin with “Qm.” This is because the hash is actually a multihash, meaning that the hash itself specifies the hash function and length of the hash in the first two bytes of the multihash. In the examples above, the first two bytes in hex is 1220, where 12 denotes that this is the SHA256 hash function and 20 is the length of the hash in bytes — 32 bytes.
The data and named links gives the collection of IPFS objects the structure of a Merkle DAG — DAG meaning Directed Acyclic Graph, and Merkle to signify that this is a cryptographically authenticated data structure that uses cryptographic hashes to address content.
To visualize the graph structure, we will visualize an IPFS object by a graph with Data in the node and the Links being directed graph edges to other IPFS objects, where the Name of the Link is a label on the graph edge. The example above is visualized as follows:
We will now give examples of various data structures that can be represented by IPFS objects.
IPFS can easily represent a file system consisting of files and directories. Below we’ll break down how small and large files are represented with some supporting examples.
A small file (< 256 kB) is represented by an IPFS object with data being the file contents (plus a small header and footer) and no links, i.e. the links array is empty. Note that the file name is not part of the IPFS object, so two files with different names and the same content will have the same IPFS object representation and hence the same hash.We can add a small file to IPFS using the command ipfs add:
$ ipfs add test_dir/hello.txt
We can view the file contents of the above IPFS object using ipfs cat:
$ ipfs cat QmfM2r8SeH2GiRaC4esTjeraXEachRt8ZsSeGaWTPLyMoG
Viewing the underlying structure with ipfs object get yields:
$ ipfs object get
“Data”: “\u0008\u0002\u0012\rHello World!\n\u0018\r”}
We visualize this file as follows:
A large file (> 256 kB) is represented by a list of links to file chunks that are < 256 kB, and only minimal Data specifying that this object represents a large file. The links to the file chunks have empty strings as names.
$ ipfs add test_dir/bigfile.js
added QmR45FmbVVrixReBwJkhEKde2qwHYaQzGxu4ZoDeswuF9w test_dir/bigfile.js
$ ipfs object get QmR45FmbVVrixReBwJkhEKde2qwHYaQzGxu4ZoDeswuF9w
“Data”: “\u0008\u0002\u0018* \u0010 \u0010 \n”}
A directory is represented by a list of links to IPFS objects representing files or other directories. The names of the links are the names of the files and directories. For instance, consider the following directory structure of the directory test_dir:
$ ls -R test_dir
bigfile.js hello.txt my_dir
The files hello.txt and my_file.txt both contain the string Hello World!\n. The file testing.txt contains the string Testing 123\n.
When representing this directory structure as an IPFS object it looks like this:
Note the automatic deduplication of the file containing Hello World!\n, the data in this file is only stored in one logical place in IPFS (addressed by its hash).
The IPFS command-line tool can seamlessly follow the directory link names to traverse the file system:
$ ipfs cat
Versioned File Systems
IPFS can represent the data structures used by Git to allow for versioned file systems. The Git commit objects are described in the Git Book. The main properties of the Commit object are that it has one or more links with names parent0, parent1, etc pointing to previous commits, and one link with name object (a tree in Git) that points to the file system structure referenced by that commit.
Let’s use the same example as our previous file system directory structure, along with two commits: The first commit is the original structure, and in the second commit we’ve updated the file my_file.txt to say Another World! instead of the original Hello World!.
Also note here that we have automatic deduplication, so that the new objects in the second commit are just the main directory, the new directory my_dir, and the updated file my_file.txt.
This is one of the most exciting use cases for IPFS. A blockchain has a natural DAG structure in that past blocks are always linked by their hash from later ones. More advanced blockchains like the Ethereum blockchain also have an associated state database which has a Merkle-Patricia tree structure that also can be emulated using IPFS objects.
We assume a simplistic model of a blockchain where each block contains the following data:
- A list of transaction objects
- A link to the previous block
- The hash of a state tree/database
This blockchain can then be modeled in IPFS as follows:
We see the deduplication we gain when putting the state database on IPFS — between two blocks, only the state entries that have been changed need to be explicitly stored, rather than the entire state (which is a much greater data burden).
An interesting point here is the distinction between storing data on the blockchain and storing hashes of data on the blockchain. On the Ethereum platform, you pay a large fee for storing data in the associated state database, in order to minimize bloat of the state database (“blockchain bloat” or “state bloat”). Thus it’s a common design pattern for larger pieces of data to store not the data itself, but an IPFS hash of the data in the state database.
Generally, blockchains make a distinction between what is in the global ledger replicated by every miner (aka, data stored in the chain itself) vs the data that might be referenced within the chain but isn't replicated between all nodes and should be looked up separately (ex, because it is too large). If the blockchain with its associated state database is already represented in IPFS, then the distinction between storing a hash on the blockchain and storing the data on the blockchain becomes somewhat blurred, since everything is stored in IPFS anyway, and the hash of the block only needs the hash of the state database. In this case, if someone has stored an IPFS link in the blockchain, we can seamlessly follow this link to access the data as if the data was stored in the blockchain itself. In a world where both the blockchain and the linked-to data are IPFS-powered—all links (both those within the chain to state stored by all nodes and those linking to data off-chain) will be IPFS links.
We can still make a distinction between on-chain and off-chain data storage, however, by looking at what miners need to process when creating a new block. In the current Ethereum network, the miners need to process transactions that will update the state database. To do this, they need access to the full state database in order to be able to update it wherever it is changed.
Thus in the blockchain state database represented in IPFS, we would still need to tag data as being “on-chain” or “off-chain.” The “on-chain” data would be necessary for miners to retain locally in order to mine, and this data would be directly affected by transactions. The “off-chain” data would have to be updated by users and would not need to be touched by miners.
We hope you found this overview helpful. In part two, we'll show you how to get started with Infura’s IPFS service, including a tutorial on how to use our IPFS API to pin your data across the IPFS network, as well as upload and access files.
Subscribe to our Infura newsletter so you can be the first to know when our new IPFS product has launched. As always, if you have questions or feature requests, you can join our community or reach out to us directly.
Filecoin, a complementary protocol to IPFS that provides a persistent, distributed data storage system, is launching its mainnet this autumn. You can also sign up for ConsenSys + Filecoin updates for news about collaborations between our product teams and ways to get involved.