DataLad project

This is the DataLad project’s (meta) documentation. It should have everything there is to know about DataLad – the project, not (just) the software. This includes how things are done, what we are planning on doing, and maybe why we are no longer doing things in particular ways.

Select any topic from the menu or search this site for information.

About the DataLad Project

DataLad is a Python-based distributed data management system that keeps track of your data with version control, creates structure, ensures reproducibility, supports collaboration, and integrates with widely used data infrastructure. DataLad (the software) is developed and maintained as a free and open source project by a global and interdisciplinary community of scientists.

The primary goal of the DataLad project is to support the collaborative process of distilling knowledge from data according to the FAIR Guiding PrinciplesFindability, Accessibility, Interoperability, and Reusability. We emphasize creating an inclusive, supportive space where users are empowered to make the most of our products and contribute to the project and community, and we strive to foster an interconnected network through interoperable software development within a larger ecosystem, the organization of community events, and participation in collaborative research initiatives.

Historically, the DataLad project was established for researchers in medicine and the neurosciences, and it is hosted through a collaboration between the Brain and Behaviour division of the Institute of Neuroscience and Medicine (INM-7) at Foschungszentrum Jülich and the Center for Open Neuroscience affiliated with Dartmouth University. The project’s domain-agnostic focus on software interoperability through integrations and extensions now extends its reach into diverse disciplines to anyone seeking to work responsibly with data. DataLad is governed as a consensus-based meritocracy relying on its thousands of users and its dedicated contributors.

Users and developers can ask questions and support each other asynchronously in the community Matrix chat or via Q&A (Question and Answer) portals, or interface live during an online office hour call. Contributors engage with the project through many avenues — including the various communication channels — by submitting issues, documentation, and code for consideration in project code repositories and participating in discussions on development, project management and strategic planning, and voting on the development mailing list. Community membors are encouraged to show their support for the project by following the DataLad blog and social media accounts. All members of the DataLad community adhere to its Code of Conduct.

DataLad development is primarily funded by the U.S. National Science Foundation (NSF) and the German Federal Ministry of Education and Research (BMBF). Additional support has been provided by the Helmholtz Research Center Jülich (FZJ), U.S. National Institute of Biomedical Imaging and Bioengineering (NIBIB) via ReproNim, the European Union’s Horizon 2020 research and innovation programme, the Deutsche Forschungsgemeinschaft(DFG), and the German federal state of Saxony-Anhalt and the European Regional Development Fund. All DataLad code and learning resources are available under open-source licenses, and the software itself and its associated documentation are published under the MIT license.

Subsections of DataLad project

Subsections of Products

DataLad extensions

DataLad is not just a single software package. Numerous extension packages can equip the base package with additional functionality, or even tailor and tune the way the base package works.

DataLad extensions are shipped as separate Python packages. The installation is typically done with standard Python package managers, such as pip. For some extensions it may be necessary to perform additional set up steps in order to become fully functional.

Here is a list of available extension packages for the DataLad software:

Subsections of DataLad extensions

datalad-container extension

Static Badge Static Badge Static Badge

This extension equips DataLad’s run/rerun functionality with the ability to transparently execute commands in containerized computational environments.

datalad-next extension

Static Badge Static Badge Static Badge

This DataLad extension can be thought of as a staging area for additional functionality, or for improved performance and user experience. Unlike other topical or more experimental extensions, the focus here is on functionality with broad applicability. This extension is a suitable dependency for other software packages that intend to build on this improved set of functionality.

datalad-xnat extension

Static Badge Static Badge Static Badge

This extension packages equips DataLad with a set of commands to track XNAT projects.

XNAT is an open source imaging informatics platform developed by the Neuroinformatics Research Group at Washington University. It facilitates common management, productivity, and quality assurance tasks for imaging and associated data. XNAT can be used to support a wide range of neuro/medical imaging-based projects.

Installer

Static Badge Static Badge Static Badge

The DataLad installer is a utility for installing Datalad, git-annex, and related components all in a single invocation. It requires no third-party Python libraries, though it does make heavy use of external packaging commands.

DataSalad

Static Badge Static Badge Static Badge

This is a pure-Python library with a collection of utilities for working with data in the vicinity of Git and git-annex. While this is a foundational library from and for the DataLad project, its implementations are standalone, and are meant to be equally well usable outside the DataLad system.

A focus of this library is efficient communication with subprocesses, such as Git or git-annex commands, which read and produce data in some format.

Subsections of Support

Office hour

We run a weekly online office hour, where anybody can pop in to ask questions, or to demo challenges. The location/link is communicated via the office hour chatroom. This is also the place where cancellations or holiday breaks are announced.

The current office hour slot is every Tuesday 16:00 CE(S)T.

Q&A portals

There is some monitoring of Q&A portal:

  • neurostars.org using the datalad tag. This is the most active Q&A site for DataLad-related questions, due to the historically large user community in this field.
  • [https://stackoverflow.com] using the datalad tag

Documentation

Subsections of Documentation

Handbook

Knowledge base

The site https://knowledge-base.psychoinformatics.de provides a collection of solution or workarounds for particular challenges. These items are typically curated outcomes of prior support requests.

For individual packages

Mention readthedocs and docs.datalad.org

Subsections of Websites/Services

Governance

This is a meritocratic, consensus-based community project. Anyone with an interest in the project can join the community, contribute to the project design and participate in the decision making process. This document describes how that participation takes place and how to set about earning merit within the project community.

Scope

The DataLad project develops software that is part of a larger ecosystem of interoperable components, also contributed by other entities. This document is exclusively concerned with the governance of a subset of those components that are collectively developed and maintained by the DataLad project. The components presently are:

Changes to this list are made by the project management committee.

Project setup and processes

The sections listed below provide detailed descriptions of individual aspects of this project’s governance.

Acknowledgments

This document is based on a template written by Ross Gardler and Gabriel Hanganu accessible at http://oss-watch.ac.uk/resources/meritocraticgovernancemodel.

Subsections of Governance

Communication

The main communication channel for contributors is the development mailing list, presently at datalad-devel@fz-juelich.de. Mailing list messages are archived. The list archive is public, presently at https://lists.fz-juelich.de/pipermail/datalad-devel. The project maintains a number of additional communication channels for a variety of purposes and audiences. However, all communication on management, strategic planning, and voting takes place on the development mailing list. All contributors are encouraged to subscribe to this mailing list.

Roles and responsibilities

Users

Users are community members who have a need for the project. They are the most important members of the community and without them the project would have no purpose. Anyone can be a user; there are no special requirements.

The project asks its users to participate in the project and community as much as possible. User contributions help to ensure that the project outputs satisfy the needs of those users. Common user contributions include (but are not limited to):

  • evangelizing about the project (e.g. a link on a website and word-of-mouth awareness raising)
  • informing developers of strengths and weaknesses from a new user perspective
  • providing moral support (a “thank you” goes a long way)
  • showing support, e.g., by “staring” the project on GitHub or subscribing/liking/following the project on social media.

Users who continue to engage with the project and its community will often become more and more involved. Such users may find themselves becoming contributors.

Contributors

Contributors are community members who contribute in concrete ways to the project. Anyone can become a contributor, and contributions can take many forms. There is no expectation of commitment to the project, no specific skill requirements and no selection process.

In addition to their actions as users, contributors may also find themselves doing one or more of the following:

  • supporting new users (existing users are often the best people to support new users)
  • reporting bugs
  • identifying requirements
  • providing graphics and web design
  • programming
  • assisting with project infrastructure
  • writing documentation
  • fixing bugs
  • adding features

Contributors engage with the project through issue trackers or other communication channels, or by writing or editing documentation. They submit changes to the project code repositories, which will be considered for inclusion by existing committers. The development mailing list is the most appropriate place to ask for help when making that first contribution.

Contributors are expected to behave in accordance with the project’s code of conduct.

As contributors gain experience and familiarity with the project, their profile within, and commitment to, the community will increase. At some stage, they may find themselves being nominated for committership.

Committers

Committers are community members who have shown that they are committed to the continued development of the project through ongoing engagement with the community. Committership allows contributors to more easily carry on with their project-related activities by giving them direct access to the project’s resources. That is, they can make changes directly to project outputs, without having to submit changes via patches.

This does not mean that a committer is free to do what they want. In fact, committers have no more authority over the project than contributors. While committership indicates a valued member of the community who has demonstrated a healthy respect for the project’s aims and objectives, their work continues to be reviewed by the community before acceptance in an official release. The key difference between a committer and a contributor is when this approval is sought from the community. A committer seeks approval after the contribution is made, rather than before.

Seeking approval after making a contribution is known as a commit-then-review process. It is more efficient to allow trusted people to make direct contributions, as the majority of those contributions will be accepted by the project. The project employs various communication mechanisms to ensure that all contributions are reviewed by the community as a whole. By the time a contributor is invited to become a committer, they will have become familiar with the project’s various tools as a user and then as a contributor.

Anyone can become a committer; there are no special requirements, other than to have shown a willingness and ability to participate in the project as a team player. Typically, a potential committer will need to show that they have an understanding of the project, its objectives and its strategy. They will also have provided valuable contributions to the project over a period of time.

New committers can be nominated by any existing committer. Once they have been nominated, there will be a vote by the project management committee (PMC). Committer voting is one of the few activities that takes place on the project’s private management channel. This is to allow PMC members to freely express their opinions about a nominee without causing embarrassment. Once the vote has been held, the outcome of the vote is communicated to the project via the development mailing list. The nominee is entitled to request an explanation of any ’no’ votes against them, regardless of the outcome of the vote. This explanation will be provided by the PMC Chair and will be anonymous and constructive in nature.

Nominees may decline their appointment as a committer. However, this is unusual, as the project does not expect any specific time or resource commitment from its community members. The intention behind the role of committer is to allow people to contribute to the project more easily, not to tie them in to the project in any formal way.

It is important to recognize that commitership is a privilege, not a right. That privilege must be earned and once earned it can be removed by the PMC in extreme circumstances. However, under normal circumstances committership exists for as long as the committer wishes to continue engaging with the project. Effective commit access to project repositories for an individual person may be disabled for security reasons when they have not contributed within the last 12 months, but is reinstated upon request by the committer.

A committer who shows an above-average level of contribution to the project, particularly with respect to its strategic direction and long-term health, may be nominated to become a member of the PMC.

Project management committee (PMC)

The project management committee has additional responsibilities over and above those of a committer. These responsibilities ensure the smooth running of the project. PMC members are expected to review code contributions, participate in strategic planning, approve changes to the governance model and manage the copyrights within the project outputs.

Members of the PMC do not have significant authority over other members of the community, although it is the PMC that votes on new committers. It also makes decisions when community consensus cannot be reached. The PMC also decides whether to include additional components of the DataLad ecosystem under this governance umbrella, or whether to remove components that have been covered previously. In addition, the PMC has access to the project’s private communication channels. These are used for sensitive issues, such as votes for new committers and legal matters that cannot be discussed in public. They are never used for project management or planning.

Membership of the PMC is by invitation from the existing PMC members. A nomination will result in discussion and then a vote by the existing PMC members. PMC membership votes are subject to consensus approval of the current PMC members.

PMC Chair

The PMC Chair is a single individual, voted for by the PMC members. Once someone has been appointed Chair, they remain in that role until they choose to retire, or the PMC casts a two-thirds majority vote to remove them.

The PMC Chair has no additional authority over other members of the PMC: the role is one of coordinator and facilitator. The Chair is also expected to ensure that all governance processes are adhered to. If there is no external PMC member that can function as a tie breaker, the Chair has the casting vote when the project fails to reach consensus.

External PMC member

The PMC may have one member who is not a committer. This person shall represent the larger ecosystem and community that DataLad is part of. Membership is by invitation from PMC members, and voted for by PMC members. Membership continues until the person chooses to retire, or the PMC casts a two-thirds majority vote to remove them. The external PMC member does not have a binding vote. However, in case of a tie between choices in the outcome of a vote, the external PMC member is asked to select one of these options as the winner.

Support

All participants in the community are encouraged to provide support for new users. This support is provided as a way of growing the community. Those seeking support should recognize that all support activity within the project is voluntary and is therefore provided as and when time allows. A user requiring guaranteed response times or results should therefore seek to purchase a support contract from a community member. However, for those willing to engage with the project on its own terms, and willing to help support other users, the community support channels are ideal.

Contribution process

Anyone can contribute to the project, regardless of their skills, as there are many ways to contribute. For instance, a contributor might be active on the project mailing list and issue tracker, or might supply patches. The development mailing list is the most appropriate place for a contributor to ask for help when making their first contribution.

Decision making process

Decisions about the future of the project are made through discussion with all members of the community, from the newest user to the most experienced PMC member. All non-sensitive project management discussion takes place on the development mailing list. Occasionally, sensitive discussion occurs on a private channel.

In order to ensure that the project is not bogged down by endless discussion and continual voting, the project operates a policy of lazy consensus. This allows the majority of decisions to be made without resorting to a formal vote.

Lazy consensus

Decision making typically involves the following steps:

  • Proposal
  • Discussion
  • Vote (if consensus is not reached through discussion)
  • Decision

Any community member can make a proposal for consideration by the community. In order to initiate a discussion about a new idea, they should send an email to the development mailing list or submit a patch implementing the idea to the issue tracker (or version-control system if they have commit access). This will prompt a review and, if necessary, a discussion of the idea. The goal of this review and discussion is to gain approval for the contribution. Since most people in the project community have a shared vision, there is often little need for discussion in order to reach consensus.

In general, as long as nobody explicitly opposes a proposal or patch, it is recognized as having the support of the community. This is called lazy consensus - that is, those who have not stated their opinion explicitly have implicitly agreed to the implementation of the proposal.

Lazy consensus is a very important concept within the project. It is this process that allows a large group of people to efficiently reach consensus, as someone with no objections to a proposal need not spend time stating their position, and others need not spend time reading such mails.

For lazy consensus to be effective, it is necessary to allow at least 96 hours before assuming that there are no objections to the proposal. This requirement ensures that everyone is given enough time to read, digest and respond to the proposal. This time period is chosen so as to be as inclusive as possible of all participants, regardless of their location and time commitments.

Voting

Not all decisions can be made using lazy consensus. Issues such as those affecting the strategic direction or legal standing of the project must gain explicit approval in the form of a vote. Every member of the community is encouraged to express their opinions in all discussion and all votes. However, only project committers and/or PMC members have binding votes for the purposes of decision making.

Procedure

If a formal vote on a proposal is called (signaled simply by sending an email with [VOTE] in the subject line), all subscribers of the development mailing list may express an opinion and vote. They do this by sending an email in reply to the original [VOTE] email, with the following vote and information:

  • +1 (yes, agree): also willing to help bring about the proposed action
  • +0 (yes, agree): not willing or able to help bring about the proposed action
  • -0 (no, disagree): but will not oppose the action’s going forward
  • -1 (no, disagree): opposes the action going forward and must propose an alternative action to address the issue (or a justification for not addressing the issue)

To abstain from the vote, participants simply do not respond to the email. However, it can be more helpful to cast a +0 or -0 than to abstain, since this allows the team to gauge the general feeling of the community if the proposal should be controversial.

Every member of the community, from interested user to the most active developer, has a vote. The project encourages all members to express their opinions in all discussion and all votes. However, only some members have binding votes for the purposes of decision making (see below). It is therefore their responsibility to ensure that the opinions of all community members are considered. While not all members may have a binding vote, a well-justified -1 from a non-committer must be considered by the community, and if appropriate, supported by a binding -1. A -1 can also indicate a veto, depending on the type of vote and who is using it. Someone without a binding vote cannot veto a proposal, so in their case a -1 would simply indicate an objection.

When a [VOTE] receives a -1, it is the responsibility of the community as a whole to address the objection. Such discussion will continue until the objection is either rescinded, overruled (in the case of a non-binding veto) or the proposal itself is altered in order to achieve consensus (possibly by withdrawing it altogether). In the rare circumstance that consensus cannot be achieved, the PMC will decide the forward course of action.

In summary:

  • Those who don’t agree with the proposal and think they have a better idea should vote -1 and defend their counter-proposal.
  • Those who don’t agree but don’t have a better idea should vote -0.
  • Those who agree but will not actively assist in implementing the proposal should vote +0.
  • Those who agree and will actively assist in implementing the proposal should vote +1.

Types of approval

Different actions require different types of approval, ranging from lazy consensus to a majority decision by the PMC. These are summarised in the table below. The section after the table describes which type of approval should be used in common situations.

Type Description Duration
Lazy consensus An action with lazy consensus is implicitly allowed, unless a binding -1 vote is received. Depending on the type of action, a vote will then be called. Note that even though a binding -1 is required to prevent the action, all community members are encouraged to cast a -1 vote with supporting argument. Committers are expected to evaluate the argument and, if necessary, support it with a binding -1. N/A
Lazy majority A lazy majority vote requires more binding +1 votes than binding -1 votes. 72 hours
Consensus approval Consensus approval requires three binding +1 votes and no binding -1 votes. 72 hours
Unanimous consensus All of the binding votes that are cast are to be +1 and there can be no binding vetoes (-1). 120 hours
2/3 majority Some strategic actions require a 2/3 majority of PMC members; in addition, 2/3 of the binding votes cast must be +1. Such actions typically affect the foundation of the project (e.g. adopting a new codebase to replace an existing product). 120 hours

When is a vote required?

Every effort is made to allow the majority of decisions to be taken through lazy consensus. That is, simply stating one’s intentions is assumed to be enough to proceed, unless an objection is raised. However, some activities require a more formal approval process in order to ensure fully transparent decision making.

The table below describes some of the actions that will require a vote. It also identifies which type of vote should be called.

Action Description Approval type
Release plan Defines the timetable and actions for a release. A release plan cannot be vetoed (hence lazy majority). Lazy majority
Product release When a release of one of the project’s products is ready, a vote is required to accept the release as an official release of the project. A release cannot be vetoed (hence lazy majority). Lazy majority
New committer A new committer has been proposed. Consensus approval of the PMC
New PMC member A new PMC member has been proposed. Consensus approval of the community
Committer removal When removal of commit privileges is sought. Unanimous consensus of the PMC
PMC member removal When removal of PMC membership is sought. Unanimous consensus of the community

Subsections of Development

Roadmap

DataLad started out as a rather monolithic code base that mixed a Python library, a Python API geared towards interactive use, and a command line interface (CLI). The general development trajectory is to disentangle the code, and form a more modular, layered software system that comprises:

  • dedicated applications providing a CLI, a graphical user interface (GUI), and a Python-based command API for interactive use and scripting
  • a collection of topical extension packages
  • utility libraries for a DataLad framework of closely aligned implementations
  • another utility library with generic algorithms and implementations, not considered to be part of the DataLad framework

The schema below depict the envisioned relationships and dependencies between these components (solid arrows indicate dependencies and dashed ones optional usage).

graph LR;
  subgraph "Non-framework<br>utility libraries"
      salad("datasalad")
  end
  subgraph DataLad framework
      core("datalad-core")
      next("datalad-next")
  end
  subgraph "DataLad<br>applications"
      dlcmd("dlcmd (CLI)")
      gooey("gooey (GUI)")
      py("Python<br>Command API")
  end
  subgraph "(3rd-party)<br>Extension packages"
      extension("datalad-${extension}")
  end
  salad ---> core
  salad ---> next
  salad ---> extension
  core --> next
  core --> extension
  next -.-> extension
  core --> dlcmd
  next -.-> dlcmd
  extension -.-> dlcmd
  core --> gooey
  next -.-> gooey
  extension -.-> gooey
  core --> py
  next -.-> py
  extension -.-> py
  %% node links to websites
  click salad href "https://github.com/datalad/datasalad"
  click core href "https://github.com/datalad/datalad-core"
  click next href "https://github.com/datalad/datalad-next"
  click dlcmd href "/dev/dlcmd/"

Targeted components

Non-framework utility libraries

Such libraries hold implementations developed by the DataLad project and for the DataLad project that are nevertheless so generic that they are not considered to be part of the DataLad framework. The means:

  • no DataLad jargon in messages
  • no dependencies on other DataLad components
  • no use of DataLad facilities (e.g., recognition of DataLad-specific configuration)

A concrete library example is datasalad, which provides tooling to work with subprocesses.

DataLad framework libraries

These library provide everything necessary to implement DataLad command and have them work in a uniform fashion. This includes aspects like configuration management, particular workflows (e.g., credential input and storage), and working with git(-annex) repositories in a particular “DataLad way”.

We distinguish two different libraries: core and next.

The core library provide the essential set of DataLad functionality that is broadly applicable to the widest range of use cases. It aims to have a lean dependency footprint to enable deploying DataLad in a wide range of environments. The current development state is available at https://github.com/datalad/datalad-core.

The next library serves the same purpose and scope as the core library. It is, however, a staging area for making new and improved implementations available before they may migrate in the core library. While core evolves at a comparatively slow pace, next is expected to have a much higher frequency of feature releases. The current development state is available at https://github.com/datalad/datalad-next.

Topical DataLad extension packages can use both libraries to implement their functionality.

User interfaces

The libraries are accompanied by applications that provide concrete user interfaces. These (can) include:

  • command line interface (CLI)
  • graphical user interface (GUI)
  • language-bindings or scripting interfaces

Such interface applications could be lean (only proxying library functionality), or heavily tailored for a specific purpose. There is no assumption of exclusivity. For example, there can be any number of CLI implementations.

In order to cleanly separate the underlying requirements and dependencies, even a “Python command API” is distinguished from the framework libraries (also written in Python). Only the former will define aspects like a uniform logging/messaging behavior.

Topical extension packages

Extension packages extend DataLad with additional functionality. Many extensions are provided by the DataLad project, but their can be implemented completely independent of the project and require no approval and generally need no coordination with the DataLad project.

Any functionality that is out-of-scope for the DataLad framework libraries can be implemented in an extension package.

Extension development is facilitated by a project template at https://github.com/datalad/datalad-extension-template.

Examples of extension packages are

Continuous integration

Appveyor

We have a paid subscription of the Appveyor CI/CD service. It is administered by mih. Logs for projects can be found at URLs of the pattern https://ci.appveyor.com/project/mih/<project-name>.

Github actions

Forgejo actions

The hub is set up to run Forgejo actions. These are, to some degree, compatible with Github actions. This means that Forgejo can (attempt to) run Github actions, but not the other way round.

We operate runners on the following machines:

dlcmd (CLI)

dlcmd is a command line interface (CLI) for DataLad that aims to provide a modern, and convenient approach to using DataLad in a terminal.

DataLad functionality provided via dlcmd is separated into two different categories:

  • tailored, stable commands for a finite set of features
  • auto-generated interfaces for any DataLad command available in an installation

For the second category, dlcmd provides no guarantees regarding API stability, and accessibility of particular functionality in the terminal.

The first category, however, comprises individually tuned and documented commands that are specifically tailored and integrated for their joint use in a terminal. Here dlcmd is not serving as a thin layer between the terminal and a Python implementation of a command, but as a fully featured application with consistent (error) messaging, and behavior.

These implementations are individually tested to work via the CLI.

The development of dlcmd is presently conducted at https://hub.datalad.org/datalad/dlcmd