High-level Design

Note:
This is a working document and captures current design. We reserve the right to change the design at any time as prototype development moves forward.

Information scale and performance

DMS3 is designed to enable web scale information management (storage and retrieval) and provides applications with an API that abstracts information as documents stored in index repositories managed by a search engine.

Consistent performance is achieved by enforcing configurable upper bound on the number of documents in one index repository. Information scale is achieved by supporting an arbitrary number of index repositories backed by a fault tolerant distributed storage layer.

Key terms and concepts

Information is organized into index repository containers 1 considering multiple criteria:

  1. A document kind suggests a certain structure and semantic context for a class of similar documents. Documents of a certain kind are said to conform to a common schema and contain information of a common theme. The configuration of a document kind defines an infospace that specifies the concrete structure of documents in the index repository conveyed to the search engine. The semantics of substructures in the document are subject to the application that generates and processes the document.

  2. An infostore is a set of containers that collects documents of similar kind. As content is added to a repository and as that repository grows and reaches a configurable document limit, the set is expanded with an additional repository to host additional documents of the same kind. The container-set or reposet 2 may expand up to an architectural limit and provides a mechanism to physically shard information. The repository document limit and other life-cycle properties can be configured based on the kind of infospace. An infostore serves to limit the search scope of a query target.

  3. A metastore is a set of index repositories that collects documents providing metadata for corresponding infostores of similar kind. A metastore serves to limit search scope of a query target. Each document in a metastore repository describes the kind of data contained in an associated infostore. A metastore also represents a manifest for hosted information, serves as a testament made by the endpoint user regarding hosted information. Matching kind names are used to associate metastores and infostores.

Infostore and metastore life-cycle management is at the purview of a user managing a DMS3 endpoint node, also known as a source of information.

  • A source may create infostores for personal use or to publish and share information with nodes in the peer-to-peer network.

  • A metastore is defined by associating (or binding) a locally unique key name with a set of locally hosted infostores of similar kind. Key names are not required to be unique in the global DMS3 peer-to-peer network.

  • Ideally, the community of sources shares information using commonly accepted key names for aggregating metadata representing similar content semantics into searchable metadata repositories. For example,

    • A source that may want to publish and share blogs, uses a metastore under the kind key "blog" to describe the infostores food-blog and product-brand-review-blog, each of which consists of a reposet.

    • Food and product review blogs can now be added to the respective infostores.

    • The source also creates a metastore under the kind key blog, associated with the food-blog and product-brand-review-blog inforstores. This metastore also consists of a reposet.

    • A document is added to the blog metastore to describe the food-blog infostore.

    • A separate document is added the blog metastore to describe the product-brand-review-blog.

  • A source publishes information by announcing available metastores to the peer-to-peer network.

  • Published information is routed through the peer-to-peer network to enable remote search and access.
  • Published documents are always on the local endpoint node controlled by the source.
  • Routing protocols enable periodic refresh and revocation of published information.
  • Published information may represent a broad number of semantic themes, such as:
    • tweets, messages, blogs, podcasts, media, news, research, and other kinds of articles
    • applications that people can access and run on their client nodes
    • data produced by an application that others can search

Any source managing a dms3 node may decide independently to assume a role as curator 3 providing information services on behalf of other peer-to-peer network nodes.

Installing the software

Installation instructions will be provided at a later time.

DMS3 builds on forked versions of key open source software components. We thank all open source contributors and acknowledge their community contributions. A partial list of key components is provided below, additional credits will be provided at a later time.

  • IPFS with modifications to support new capabilities. In particular dms3 adds the ability to manage document index repositories and dis-intermediated search.

  • Indri Search Engine as its underlying Search Engine (SE) with some modifications that allow extensible non-text data type support.

Configuring the software

When you initialize dms3 with the dms3 init command, dms3 uses a repository in the local file system.

By default, the repo is located at ~/.dms3. To change the repo location, set the $DMS3_PATH environment variable:

    export DMS3_PATH=/path/to/dms3repo

dms3 expands the use of local file system by using the index sub-folder within it for managing index repositories.

Decentralized applications (DApps)

Base command line interface (CLI)

Summary of index management commands

A summary of index repository life-cycle commands are listed below.

  • use config to get or set index parameters
  • use mkidx to create an index repository
  • use mkdoc to generate a document template to be indexed
  • use addoc to index a document
  • use rmdoc to delete a document added to an index repository
  • use ls to display a list of index repositories
  • use show to display a document in an index repository
  • use stat to display index repository statistics
  • use start to run in daemon mode
  • use stop to exit daemon mode
  • use restart to restart the daemon
  • use recover to destructively rebuild a corrupted index repository

When not running in daemon mode, each command affecting an index, first performs an integrity check on the index and may detect that the index has been corrupted. Some corruptions may be healed, in other cases the recover command must be used to destroy the corrupted index and rebuild the index from redundant storage. In daemon mode, index integrity checks are performed when the daemon first opens and index repository.

Base graphic user interface (GUI)

To-Be-Specified

Search Engine

DMS3 uses Vank as its underlying Search Engine (SE). Vank is a forked version of the open source Indri Search Engine augmented with a library that allows extensible application data type support.

Refer to the following page for a background on the Indri search engine.

Refer to Indri repository structure for a discussion of the repository on disk structure.

Refer to Indri parameters file for a discussion of search engine indexer configuration parameters.

Application Programming Interface (API)

DMS3 provides an API to simplify dapp development that targets the platform. Details of the dms3api will be provided at a later time.

The API helps dapps create or connect to multiple index repositories. Each repository has an active in-memory index where new documents are added to the index dynamically. In-memory indexes are written out to disk for durability, and when the number of indexes grows a background process merges repository indexes consolidating them on disk. All this dynamic index management is performed by the search engine for each of the connected repository contexts.

DMS3 treats each index repository as a container, and manages the growth in information using multiple kinds of reposets to physically shard information in bounded containers.

Information Blockchain

DMS3 Information block chain is a filing system for organizing information to facilitate the creation, storage, retrieval, and sharing of information libraries that scale in size and offer high performance search.

Information is organized into a container namespace managed by lifecycle management libraries and a high performance search engine as searchable index repository sets (a.k.a. reposet).

The container namespace is identified by an ordered set of component name keys defined as:

type
A container has a type classifying the contained information as either data (value "infostore") or metadata (value "metastore").
kind
A container is assigned a unique abstract kind that defines a common structure and semantic theme over the documents it contains. The value for kind is a string chosen by the end user.
name
A container has a name value assigned by the end user that is unique within a kind. Information added into an index repositories is secured by the DMS3 blockchain.
setID
A container is assigned a setID that specifies the order of the container within the reposet. The value is an integer bound by a maximum number defined by the architecture, initial range is [1,255].
wwID
The index library assigns wwID (who-what ID) that identifies the app (who) and the index parameters name (what) that created the index. When creating an index with the dms3api, the application name will always be "dms3" and the index parameters name will match the pre-configured kind value used to generate the parameters file.
volID
A container is assigned a volID that specifies a volume id. The volume id conveys logical group information about the documents within the container. The volume id is assigned by the lifecycle management library.

There are several DMS3 storage areas used to build the DMS3 information blockchain:

Local File Store (lfs)
Index repository state is stored in the local dms3 repository and accessed via the search engine interface. This represents a Mutable Index Repository.
Unix File Store (ufs)
Index metadata and document data files are added to the dms3 UnixFS store, so they may be used to reconstitute an index repository or be used to share information on the p2p network. This is part of the Immutable Index Repository.
Key Value Store (kvs)
Index repositories metadata is stored in a dms3 key-value store to enable recovery of index repositories in the local file store, and to share information on the p2p network. This is part of the immutable index repository.

Mutable Index State

The mkidx command is used to create a mutable index repository.

dms3 uses a path naming convention for each index repository on the local file system within the index sub-folder that contains the repository root reposet.

The following shows an example index repository path:

ls ~/.dms3/index/reposet/blog/infospace-myblog20/1/dms3-blog/w1543348319/

Path component keys map the index into the container namespace:

~/.dms3
initialized default local file system dms3 repository root
/index
local file system dms3 index repository root
/reposet
root folder for all dms3 index repositories on the local file system
/blog
kind name of the reposet, shared by all repos within the set.
/infospace-myblog20
type-name of a repo providing applications with a Physical Sharding Mechanism.
/1
setID of a repo providing container set growth for a kind of information
/dms3-blog
library assigned wwID label identifying the app name hyphen-separated index parameters name used to create the index.
/w1543348319
library assigned volID of a repo providing applications with a Logical Sharding Mechanism.

Key Search Engine Properties

This section describes some key low level interface of the search engine. Detailed documentation on the search engine will be provided later and is outside the scope of this document.

Index Metadata and Fields

The indexer defines metadata as non-searchable fields and provides forward and reverse document lookup by metadata value. Whereas other fields can be used to qualify query searches.

Document Structure and Schema

Details to be documented at a later time...

Fields are used to convey document structure and are used to influence document queries.

Supported Document Parsers

The preconfigured file class environment and parsers indexer file formats exhibit the metadata extraction behavior discussed here.

The metadata fields for an index is defined by the parameters supplied when the index is created. The only metadata required is docno, used to prevent indexing duplicate documents. The metadata field docno also represents the external name or id of the document.

As of this writing, for a document being indexed (using the addFile interface), the parser always adds the following metadata for html, xml, text, pdf files. For example:

path
/home/username/temp/data/sitemap.xml
docno
/home/username/temp/data/sitemap.xml
Path to index
The lfs path of the index repository

For html, xml, text documents, the parser additionally adds the metadata. For example:

filetype
TEXT

For html, pdf documents, the parser additionally adds the metadata. For example:

url
/

For pdf documents, the parser additionally adds the metadata. For example:

title
author

The parser additionally adds the metadata listed within a document, regardless of position in the document tag hierarchy (i.e. whether the metadata fields are wrapped within a tag or not). the metadata associated with a document is not required to include all the metadata keys specified during index creation.

A file added to the index may contain multiple documents wrapped within a tag, in the absence of such a tag, the file represents a single document.

The indexer requires one metadata field for documents: docno, the value of this field must be unique within the index, otherwise the document is parsed but not indexed.

When an application generated (using the addString interface) document includes a docno field, the index will contain two docno fields. The first added by the parser, and the second added by the application. The parser will use the last occurrence as the authoritative docno record.

Immutable Index State

Immutable index repository state enables mutable index recovery and sharing on the p2p network.

The immutable index repository state is updated when a container is created, and when a document is added to a container.

Immutable index repository state consists of the following dag node tree:

erDiagram Reposet-root ||--o{ Store-block : contains Reposet-root { Link-array-256 metastore-block Link-array-256 infostore-block } Store-block ||--o{ RepoProps : contains Store-block { Link-array-65536 RepoProps } RepoProps { String Type String Kind String Name String Path String CreatedAt Link Params Link-array-256 Corpus Boolean PublishStatus Integer PublishInterval Map Stats } RepoProps ||--|| Params : links-to Params { Bytes Tagged-parameters-file } RepoProps ||--o{ Corpus : links-to Corpus { Link-array-65536 file-id }

Initial design makes the following choices and assumptions:

1) limit maximum block sizes to 2MiBs

2) maximum corpus size per index repo container = 10TiBs

3) an average document size of 1 KiB

To accommodate the above assumptions, the following constraints can be computed:

  • max block size = 2 x 1024 x 1024 = 2097152
  • size of CID string = 32 bytes
  • CIDs per block = 2097152 / 32 = 65536
  • max corpus size 10 x 1024 x 1024 x 1024 = 10737418240
  • max documents per container 10737418240 / 1024 = 10485760
  • max CIDs per container = 10485760 / 32 = 327680
  • max corpus blocks per container = 327680 / 65536 = 5

describe repoprops and store blocks

Managing Index Repositories

Configure

Index configuration property structure is shown below. Notes on current implementation constraints and limitations are discussed here. Configuration rules are likely to change as the functionality evolves.

Once an index of a specific kind is created, its configuration parameters must not be modified. Otherwise the search engine will misbehave with incorrect query results or worse, the engine may crash. Modifying index structure will renumber terms like metadata and fields, effectively corrupting the configuration.

When recovering a corrupted index from immutable state, the documents re-added to the index will need to be re-encoded, to avoid incorrect time information. Recovery tools will be developed at a later time.

DMS3 configuration file allows configuring various index repository properties. The search engine supports its own set of configuration properties via the parameters file. The lifecycle management library further imposes additional conventions when mapping dms3 index configuration to create the search engine parameters file. The following is a summary of the current mapping conventions that will likely evolve over time:

index and corpus parameter configuration params[index] = cfg Indexer.Path * code overrides configured value * params[corpus][annotation] = cfg Indexer.Corpus.Annotation * not used * params[corpus] = cfg Indexer.Corpus params[corpus][path] = cfg Indexer.Corpus.Path * code overrides configured value * params[corpus][class] = cfg Indexer.Corpus.Class params[corpus][metadata] = cfg Indexer.Corpus.Metadata * not used *

optional parameter configuration params[memory] = cfg Indexer.Memory params[stemmer][name] = cfg Indexer.Stemmer params[normalize] = cfg Indexer.Normalize params[stopper][word] = cfg Indexer.Stopper[i]

metadata and field parameter configuration is hard coded

document kind-specific field parameter configuration params[field][name][f] = cfg Metadata.Kind[i][f]

note: the infospace interface can override some parameters at time of index creation (see MakeIndex).

TODO: remove from index configuration: Indexer.Corpus: annotations not used path is overriden (computed) metadata not used

Indexer.Path: path is overriden (computed)

Indexer.Stopper: [] is overriden, or complemented by global stopwords file

{
  "Indexer": {
    "Corpus": {
      "Class": "html",
      "Path": ""
    },
    "MaxDocs": "100M",
    "Memory": "100M",
    "Normalize": true,
    "Path": "",
    "Stemmer": "krovetz",
    "Stopper": [
      "a",
      "an",
      "the",
      "as"
    ]
  },
  "Metadata": {
    "Kind": [
      {
        "Field": [
          "About",
          "Address",
          "Affiliation",
          "Author",
          "Brand",
          "Citation",
          "Description",
          "Email",
          "Headline",
          "Keywords",
          "Language", Valid
          "Name",
          "Telephone",
          "Version"
        ],
        "Name": "blog"
      }
    ]
    "Publisher": [
      {
        "Schedule:" [
          {
            "Interval": "immediately",
            "Status": "enabled",
            "Name": [
              "myblog20"
            ]
          },
          {
            "Interval": "daily",
            "Status": "enabled",
            "Name": [
              "mideastern-foods"
            ]
          },
          {
            "Interval": "weekly",
            "Status": "disabled",
            "Name": [
            ],
          },
        ]
      }
    ]
  },
  "Retriever": {
    "MaxResultCount": 100
  }
}

Details to be documented at a later time...

Index

Details to be documented at a later time...

Query

Details to be documented at a later time...

Track

A key-value is stored in the dms3 KVS data store when a new index repository is created.

A key composing the container namespace name of an index repository is used to lookup repo statistics in the KVS.

Additional details to be documented at a later time...

Recover

The immutable index state is used to reconstruct the mutable index state.

Index container recovery will be on a container instance basis.

Details to be documented at a later time...

Publish

An index may be marked for publishing to share its content on the p2p network,

The publish properties of an index is specified by the index configuration file.

A number of mutually exclusive publishing schedules are supported. An index repository may be assigned to at most one schedule.

Publishing properties are bound to the repository name key of the container namespace, and affects all container instances within the named sub-space.

The publish properties define:

Status
Current publishing status. The value is enabled or disabled. The default value is initialized to disabled when the index repository is created.
Interval
The interval duration at which index state updates are published. Valid interval values include: immediate, daily, weekly, biweekly, monthly, quarterly, half-annual, annual
Name
The list of index repository names to be published.

Once index publishing schedule is configured, you must run the daemon with index publishing and subscription feature enabled:

dms3 daemon [--enable-index-pubsub]

Network Protocol Stacks

DMS3 offers two classes of fault tolerant services.

  1. Decentralized Information Blockchain protocol services to provide distribution and access services for shared public data.

  2. Decentralized Financial Blockchain protocol services to provide smart contract based information trading services.

  3. Centralized practical byzantine fault tolerant (PBFT) services provide data storage scaling, protection, access, and distribution services.

  4. Decentralized p2p protocol services to provide distribution and access services for shared public data.

The following sub-sections describe these services.

Decentralized Information Blockchain Services

Details to be documented at a later time...

Decentralized Financial Blockchain Services

Details to be documented at a later time...

Practical Byzantine Fault Tolerant (PBFT) Services

DMS3 allows participants to offer centralized data protection services to protect the users' personal information.

These services protect persisted information by storing dapp consistent redundant copies on multiple endpoints dedicated to one user. The set of endpoints so configured by a user form a user private network that enables access to data from any node within the defined set.

The set of nodes in the private network may be owned by the user or leased from a centralized service provider. Network services encrypt data with the user private key prior to storing on leased equipment.

Nodes and storage scaling capacity of private networks may be restricted to enable reasonable performance expectations. Private networks may be configured to asymmetrically distribute redundant information to compensate for variations of hardware configurations within the defined set of endpoints.

dms3 implements two PBFS services.

  1. Information storage and retrieval, providing index and query services.
  2. Network file system (NFS) service, providing backing for node storage.

DMS3 PBFS services automatically recover from a configured number of arbitrary simultaneous faults.

Additional details to be documented at a later time...

Index fault recovery

The search engine can heal and recover an index from certain faults on its own, for other faults external recovery support using redundant data is required.

There are two options to recover an index repository.

  1. Redundant repository state stored in dms3. When an index is created, its configuration state is saved external to the search engine. When a document is added to an index, the document is also saved external to the search engine. The recover command can destructively recreate the index from this externally save information. This recovery option does not rely on PBFT.

  2. Redundant copies the local file system index repository root. Optional PBFT services automatically keep redundant copies of information in sync. Recovery for simultaneous fault counts exceeding the configured protection level must be handled manually as an exception case.

Peer to peer (p2p) network services

Publishing a repository

Use the index config command to enable publishing of repositories to be shared with other users on the p2p network.

Published information distribution and protection services

High demand for published information can overburden participant's compute resources. DMS3 enables optional paid p2p services to offload bandwidth, compute, and storage loads onto a proprietary fault tolerant centralized Data Cloud.

p2p protocol network also enables participants to contribute compute and storage resources to gain income in return to the use of their resources.

Information curators

Any source managing a dms3 node may decide independently to assume a role as curator providing information services to the peer-to-peer network. These services are independent from locally published information and relate to information published by remote peer sources.

  • A curator may choose to operate in limited roles, offering one or more hosting services to the peer-to-peer network:
    • A directory role indicates a hosting service that enables metastore metadata queries.
    • A replica role indicates a hosting service that enables infostore data queries.
    • Each curator peer node decides what information and role, if any, is offered.
    • Curators offer source authors information distribution and access services.
  • Hosting services are periodically announced to the peer-to-peer network via the routing protocol.

The dms3 peer-to-peer network will offer cryptocurrency as incentive gas to drive the information flow economy. - A Curator may earn coins/tokens for hosting services - An Author may pay transaction fees for distribution on the network - An Author may earn tokens for published content in the information flow economy - Paid content is structured as a pair of documents - Free promotional (excerpt, abstract) short form content with terms references - Paid long form content with term enforced by a smart contract - Smart contracts enforce terms for a variety of payment plans


  1. container, and repository is used interchangeably in this document to mean the set of files and folders the search engine uses to track volatile and persistent state of a repository. 

  2. container-set and reposet is used interchangeably in this document, it represents a generic term that refers to the set of index repositories that form an infostore. 

  3. curators provide compute and storage hosting resources for published information in return for income.