Welcome to dcat’s documentation!

dcat is data cataloging service for machine learning artisans. This python package is a backed by a web service, that stores meta data for machine learning artifacts. It currently supports 3 artifact types: Feature, Label, and Model.

dcat also provides a binary for the command line shell.

Installation

Install the latest version of dcat python package using pip

pip install dcat --no-cache-dir

Configuration

dcat talks to a web service to store the meta data. Currently, dcat relies on Environment variables to determine the webservice host and credentails.

Please contact the server administrator for the host name and credentials.

Here are the two common ways to add the Environment Variables.

Shell

Add the following lines to the shell initialization file, for example ~/.bashrc for Bash or ~/.zshrc for Zsh.

export DCAT_HOST=<hostname>
export DCAT_AUTH_USER=<username>
export DCAT_AUTH_PASSWORD=<password>

Restart the terminal session for the changes to take effect.

Python

import os

os.environ['DCAT_HOST'] = hostname
os.environ['DCAT_AUTH_USER'] = username
os.environ['DCAT_AUTH_PASSWORD'] = password

Again, please contact the administrator for the valid host name and credentials.

Artifacts

Dcat supports following artificat types. These artifcats should have a close resemblance the the artificats a machine learning artisan would work with.

Artifacts are identified by their Data Resource Names aka. DRNs. Each Artifact type has a unique DRN format.

Projects

DRN Format: drn:<project-slug>

A project is a top level artifact. We assume that a data team works on several projects. A project is mostly an independant entity, with its own ML models, features and labels.

CLI

The CLI binary supports following operations

# list projects
dcat projects list

# create project with name "Stock Prediction"
# and optional slug 'stocks'
dcat projects create "Stock Prediction" --slug 'stocks'

# modify an existing project
# projects are identified by their DRNs
# A project DRN will have a format of drn:<project-slug>
# For the above example, the drn will be `drn:stocks`

dcat projects modify drn:stocks --name "Stock Options Prediction"

# delete project
dcat projects delete drn:stocks

# getting help
dcat projects --help
dcat projects modify --help

Python

API
class dcat.project.Project(verbosity=0)
create(name, slug=None)
delete(drn)
list(**kwargs)
modify(drn, name=None)
Example
from dcat import Project

# list of projects
projects = Project.list()

# create a project
project = Project.create("Stock Prediction", slug="stocks")

# modify project
Project.modify("drn:stocks", name="Stock Option Prediction")

# delete project
Project.delete("drn:stocks")

Features

DRN Format: drn:<project-slug>:features:<version-id>

A feature or a feature matix is a feature vector for training and/or testing a machine learning model. A feature is always associated to a single project. Feature is expected to be stored at a given path, and that path should be added to this feature artifact.

CLI

The CLI binary supports following operations

# list all features in drn:stocks project
dcat features list drn:stocks

# create a new feature under drn:stocks project
# version is auto generated, but can also be explicitly passed
dcat features create drn:stocks \
        "s3://stocks-bucket/features/goog-features.parquet" \
        --version 'goog'

# modify an existing feature
# features are identified by their DRNs
# A feature DRN will have a format of
# drn:<project-slug>:features:<version-id>
# For the above example, the drn will be `drn:stocks:features:goog`

dcat features modify \
        drn:stocks:features:goog --version "google"

# delete project
dcat features delete drn:stocks:features:google

# getting help
dcat features --help
dcat features modify --help

Python

API
class dcat.feature.Feature(verbosity=0)
create(project_drn, path, attributes={}, lineages=[], version=None, notes=None)
delete(res_drn)
list(project_drn, attributes={}, lineages=[])
modify(res_drn, attributes={}, lineages=[], remove=False, version=None, notes=None)
Example
from dcat import Feature

# list all features in drn:stocks project
features = Feature.list("drn:stocks")

# create a feature in drn:stocks project
feature = feature.create(
        "drn:stocks",
        "s3://stocks-bucket/features/goog-features.parquet",
        version="goog"
)

# modify the feature we just created
Feature.modify(
        "drn:stocks:features:goog",
        version="google"
)

# delete the feature we just created
Feature.delete("drn:stocks:features:google")

Labels

DRN Format: drn:<project-slug>:labels:<version-id>

A label is a data set with mapping between instance ids and target outcome. A label is always associated to a single project. Label is expected to be stored at a given path, and that path should be added to this label artifact.

CLI

The CLI binary supports following operations

# list all labels in drn:stocks project
dcat labels list drn:stocks

# create a new label under drn:stocks project
# version is auto generated, but can also be explicitly passed
dcat labels create drn:stocks \
        "s3://stocks-bucket/labels/buy-goog-labels.parquet" \
        --version 'buy-goog'

# modify an existing label
# labels are identified by their DRNs
# A label DRN will have a format of
# drn:<project-slug>:labels:<version-id>
# For the above example, the drn will be
# `drn:stocks:labels:buy-goog`

dcat labels modify \
        drn:stocks:labels:buy-goog --version "sell-goog"

# delete project
dcat labels delete drn:stocks:labels:sell-goog

# getting help
dcat labels --help
dcat labels modify --help

Python

API
class dcat.label.Label(verbosity=0)
create(project_drn, path, attributes={}, lineages=[], version=None, notes=None)
delete(res_drn)
list(project_drn, attributes={}, lineages=[])
modify(res_drn, attributes={}, lineages=[], remove=False, version=None, notes=None)
Example
from dcat import Label

# list all labels in drn:stocks project
labels = Label.list("drn:stocks")

# create a label in drn:stocks project
label = Label.create(
        "drn:stocks",
        "s3://stocks-bucket/labels/buy-goog.parquet",
        version="buy-goog"
)

# modify the label we just created
Label.modify(
        "drn:stocks:labels:buy-goog",
        version="sell-goog"
)

# delete the label we just created
Label.delete("drn:stocks:labels:sell-goog")

Models

DRN Format: drn:<project-slug>:models:<version-id>

A model or a machine learning model is built using labels and features. A model is always associated to a single project. Model is expected to be stored at a given path, and that path should be added to this model artifact.

CLI

The CLI binary supports following operations

# list all models in drn:stocks project
dcat models list drn:stocks

# create a new model under drn:stocks project
# version is auto generated, but can also be explicitly passed
dcat model create drn:stocks \
        "s3://stocks-bucket/models/awesome-model.pkl" \
        --version 'awesome-model'

# modify an existing model
# models are identified by their DRNs
# A model DRN will have a format of
# drn:<project-slug>:models:<version-id>
# For the above example, the drn will be
# `drn:stocks:models:awesome-model`

dcat projects modify \
        drn:stocks:models:awesome-model --version "great-model"

# delete project
dcat models delete drn:stocks:models:great-model

# getting help
dcat models --help
dcat models modify --help

Python

API
class dcat.model.Model(verbosity=0)
create(project_drn, path, attributes={}, lineages=[], version=None, notes=None)
delete(res_drn)
list(project_drn, attributes={}, lineages=[])
modify(res_drn, attributes={}, lineages=[], remove=False, version=None, notes=None)
Example
from dcat import Model

# list all models in drn:stocks project
models = Model.list("drn:stocks")

# create a model in drn:stocks project
model = Model.create(
        "drn:stocks",
        "s3://stocks-bucket/models/awesome-model.pkl",
        version="awesome-model"
)

# modify the model we just created
Model.modify(
        "drn:stocks:models:awesome-model",
        version="great-model"
)

# delete the model we just created
Model.delete("drn:stocks:models:great-model")