Data cloud

How it works

Swarmia will send your data in the form of a Parquet file to your storage bucket, starting with Amazon S3. This data can then be pulled to your choice of data warehouse such as Snowflake. The full dataset will be refreshed every 8 hours to ensure there is recent enough data when the data is pulled.

Exported data

The data exported currently is information about pull requests and deployments. The export contains multiple files

pull_requests.parquet
deployments.parquet
authors.parquet
pull_request_reviews.parquet
pull_request_review_requests.parquet
pull_request_comments.parquet

The pull request data contains:

Author information
Contributing teams
Investment categories
Pull request cycle time components: cycle time, progress time, time in review, and time in merge
Timestamps of pull request events: first review time, first and last commit time, etc...
Pull request comments by authors, review requests, when pull request was reviewed and by whom etc.

The deployments data contains:

List of deployed PRs in the deployment
Deployment information: version, environment, application
DORA metrics: Time to deploy, Time to recovery, Change lead time

You can read more of these metrics in Metrics & definitions.

Getting started

The setup requires you to have an Amazon S3 storage bucket to which Swarmia can push data.

Set up the integration

Navigate to Data Cloud settings
Insert your S3 bucket credentials
Save the settings. After this, you can test the integration from the "Export" button in the UI.

Considerations for the S3 bucket

Swarmia requires the following permissions on an S3 bucket and folder to be able to upload the exports:

s3:GetBucketLocation
s3:PutObject

An example of AWS IAM Policy for the export bucket, where bucket name is the name of the bucket you provided use in the Data Cloud settings.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
              "s3:PutObject",
            ],
            "Resource": "arn:aws:s3:::<bucket name>/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<bucket name>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "*"
                    ]
                }
            }
        }
    ]
}

If you will access the data directly from the warehouse you most likely will need additional permissions. The following IAM Policy is the one recommended by Snowflake, which works then in both directions: Swarmia to upload and Snowflake to extract data from the same bucket.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
              "s3:PutObject",
              "s3:GetObject",
              "s3:GetObjectVersion",
              "s3:DeleteObject",
              "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<bucket name>/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<bucket name>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "*"
                    ]
                }
            }
        }
    ]
}

Loading data to data warehouse

You can load data into various data warehouse systems using their native S3 connection methods. Here are list of some commonly used systems and links to their documentation.

BigQuery, by using BigQuery Data Transfer Service for Amazon S3
AWS Redshift, by copying data from S3 with parquet format
Snowflake, with bulk loading from S3

Example: Loading data into Snowflake

With the following SQL template can get you started with Swarmia Data Cloud and Snowflake. Use this template to create needed Snowflake stages, file formats, and data tables to start making analytics based on Swarmia data.

Make sure to set up the access from Snowflake to the S3 bucket by following Snowflake's official instructions.

-------------------------------------------------------------------------------------------
-- Create the integration to AWS S3 bucket and the Snowflake stages

-- Create the storage integration for the AWS S3 bucket you have configured with Swarmia Data Cloud.
-- To create the role follow the Snowflake instructions in
-- https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration#step-2-create-the-iam-role-in-aws
-------------------------------------------------------------------------------------------
 
CREATE OR REPLACE STORAGE INTEGRATION swarmia_data_cloud_s3
 TYPE = EXTERNAL_STAGE
 STORAGE_PROVIDER = 'S3'
 ENABLED = TRUE
 STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<id>:role/<role name>'
 STORAGE_ALLOWED_LOCATIONS = ('s3://<bucket name>');

-- Describe our Integration
DESCRIBE INTEGRATION swarmia_data_cloud_s3;
SHOW STORAGE INTEGRATIONS;

-- Create the parquet format for Swarmia data 
CREATE OR REPLACE FILE FORMAT swarmia_parquet
 TYPE = parquet;

-- Create the stage for the Swarmia exported data
CREATE OR REPLACE STAGE swarmia_data_cloud_s3_stage
URL = 's3://<bucket name>/'
STORAGE_INTEGRATION = swarmia_data_cloud_s3 -- created in previous step
FILE_FORMAT = swarmia_parquet;

-- View our Stages
SHOW STAGES;
DESCRIBE STAGE swarmia_data_cloud_s3_stage;
LIST @swarmia_data_cloud_s3_stage;

---
-- Create the target tables
---
-- Create table for pull requests
CREATE OR REPLACE TABLE swarmia_pull_requests (
    id                              text not null,
    github_number                   int not null,
    author_id                       text not null,
    author_external_identifier      text not null,
    github_repository_full_name     text not null,
    pr_status                       text not null,
    labels                          text not null,
    owner_team_ids                  array,
    owner_team_names                array,
    investment_category_ids         array,
    investment_category_names       array,
    is_draft                        boolean,
    is_excluded                     boolean,
    github_created_at               timestamp,
    github_updated_at               timestamp,
    last_closed_or_merged_at        timestamp,
    first_commit_at                 timestamp,
    last_commit_at                  timestamp,
    opened_at                       timestamp,
    first_review_request_at         timestamp,
    first_reviewed_at               timestamp,
    first_approved_review_at        timestamp,
    last_approved_review_at         timestamp,
    cycle_time_seconds              int,
    progress_time_seconds           int,
    review_time_seconds             int,
    merge_time_seconds              int
);

CREATE OR REPLACE TABLE swarmia_deployments (
    id                             text not null,
    pull_request_ids               array,
    deployment_environment         text not null,
    deployed_at                    timestamp,
    fixes_deployment_id            text,
    fixes_version                  text,
    deployment_version             text,
    deployment_app_name            text,
    time_to_recovery_seconds       int,
    change_lead_time_seconds       int,
    time_to_deploy_seconds         int
);

CREATE OR REPLACE TABLE swarmia_authors (
    id                             text not null,
    name                           text,
    associated_emails              array
);

CREATE OR REPLACE TABLE swarmia_pull_request_comments (
    id                             text not null,
    pull_request_id                text not null,
    source_id                      text not null,
    body_text                      text not null,
    created_at                     timestamp not null,
    author_id                      text not null,
    author_is_bot                  boolean not null
);

CREATE OR REPLACE TABLE swarmia_pull_request_review_requests (
    id                             text not null,
    pull_request_id                text not null,
    requested_by_author_id         text not null,
    requested_from_author_id       text,
    requested_from_team_name       text,
    source_id                      text not null,
    requested_at                   timestamp not null
);

CREATE OR REPLACE TABLE swarmia_pull_request_reviews (
    id                             text not null,
    pull_request_id                text not null,
    author_id                      text not null,
    author_external_identifier     text not null,
    source_id                      text not null,
    state                          text not null,
    submitted_at                   timestamp not null,
    github_repository_full_name    text not null,
    source_repository_full_name    text not null,
    pull_request_review_request_id text
);

-------------------------------------------------------------------------------------------
-- Load data into the Snowflake tables
-------------------------------------------------------------------------------------------

-- Remove any old data the table has to prevent duplicates
TRUNCATE TABLE swarmia_pull_requests;
TRUNCATE TABLE swarmia_deployments;
TRUNCATE TABLE swarmia_authors;
TRUNCATE TABLE swarmia_pull_request_comments;
TRUNCATE TABLE swarmia_pull_request_review_requests;
TRUNCATE TABLE swarmia_pull_request_reviews;


-- Copy the whole dataset into the target tables
COPY INTO swarmia_pull_requests
FROM @swarmia_data_cloud_s3_stage/v0/pull_requests.parquet
    MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_deployments
FROM @swarmia_data_cloud_s3_stage/v0/deployments.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_authors
FROM @swarmia_data_cloud_s3_stage/v0/authors.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_pull_request_comments
FROM @swarmia_data_cloud_s3_stage/v0/pull_request_comments.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_pull_request_review_requests
FROM @swarmia_data_cloud_s3_stage/v0/pull_request_review_requests.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_pull_request_reviews
FROM @swarmia_data_cloud_s3_stage/v0/pull_request_reviews.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

-------------------------------------------------------------------------------------------
-- Query the data!
-------------------------------------------------------------------------------------------
SELECT COUNT(*) FROM swarmia_pull_requests;
SELECT * FROM swarmia_deployments;

-- Monthly average change lead time for 'Engineers'-team pull requests created since 2023 that have been deployed into production environment.
SELECT
    ROUND(AVG(change_lead_time_seconds))    AS change_lead_time_seconds,
    DATE_TRUNC('month', github_created_at)  AS month
FROM swarmia_deployments d
JOIN swarmia_pull_requests p ON (array_contains(p.id::variant, d.pull_request_ids))
WHERE array_contains('Engineers'::variant, p.owner_team_names)
    AND d.deployment_environment='production'
    AND p.github_created_at >= '2023-01-01'::timestamp
GROUP BY 2
ORDER BY 2 DESC;

Parquet schemas

All export schemas are defined as follow

pull_request

field

type

required

description

UTF8

UUID

github_number

INT64

GitHub number, like 5532

author_id

UTF8

Created by, UUID

title

UTF8

Pull request title

description

UTF8

Pull request description

author_external_identifier

UTF8

Created by nickname

github_repository_full_name

UTF8

Repository's name

pr_status

UTF8

One of the following: OPEN, CLOSED, MERGED

owner_team_ids

Array, UTF8

Who owns the pull request

owner_team_names

Array, UTF8

Team names of the

investment_category_ids

Array, UTF8

In which investment categories pull request belongs to

investment_category_names

Array, UTF8

Linked investment categories names

labels

Array, UTF8

Attached labels to pull request

is_draft

BOOLEAN

Whether the PR is currently marked as draft in GitHub. If the PR was originally opened as draft, but has since been marked ready for review, this will be false.

is_excluded

BOOLEAN

Whether the PR is currently excluded from metrics and alerts in Swarmia. This includes both rule-based exclusions and manual exclusions.

github_created_at

TIMESTAMP_MICROS

When the PR was created on GitHub. This is unaffected by whether the PR was opened as draft or not.

github_updated_at

TIMESTAMP_MICROS

Last time the PR data was updated in GitHub. This includes pushing code, reviews, comments, etc.

last_closed_or_merged_at

TIMESTAMP_MICROS

Most recent ClosedEvent for the PR, which covers both regular closing and merge actions. null if the PR hasn’t been closed or merged yet. If a closed PR is reopened, this goes back to null.

first_commit_at

TIMESTAMP_MICROS

The authored date (falling back to committed date) of the first commit associated with the PR.

last_commit_at

TIMESTAMP_MICROS

The authored date (falling back to committed date) of the last commit associated with the PR.

opened_at

TIMESTAMP_MICROS

When the PR was created on GitHub. This is unaffected by whether the PR was opened as draft or not.

first_review_request_at

TIMESTAMP_MICROS

Time of the first ReviewRequestedEvent for the PR.

first_reviewed_at

TIMESTAMP_MICROS

Time of the first PullRequestReview for the PR. This excludes reviews by bots, and comments by the author themselves.

first_approved_review_at

TIMESTAMP_MICROS

Time of the first PullRequestReview for the PR, with state approved. This excludes reviews by bots.

last_approved_review_at

TIMESTAMP_MICROS

Time of the last PullRequestReview for the PR, with state approved.

cycle_time_started_at

TIMESTAMP_MICROS

Time of the first commit associated with the PR, or time when the PR was created on GitHub, whichever came earlier.

cycle_time_ended_at

TIMESTAMP_MICROS

Time when the PR was merged or closed. null if the PR is still open.

additions

INT64

How many lines were added

deletions

INT64

How many lines were deleted

cycle_time_seconds

INT64

Amount of seconds between start and end of cycle time. If the PR is still open, the current time is considered the end.

progress_time_seconds

INT64

Amount of seconds between start and end of in progress time. Start is the same as cycle time start time. End is defined as, in order of preference:

Start of review time
Start of merge time
Cycle time end time
Current time

review_time_seconds

INT64

Amount of seconds between start and end of review time. Start is defined as:

For PRs without a ReadyForReviewEvent or ReviewRequestedEvent, and which don’t have approving reviews, review time is not defined
Otherwise, start is, in order of preference:
- Time of first ReviewRequestedEvent
- Time of latest ReadyForReviewEvent
- Time when PR was opened

End is defined as, in order of preference:

Start of merge time
End of cycle time
Current time

merge_time_seconds

INT64

Amount of seconds between start and end of merge time. Start is defined by, in order of precedence:

For merged and approved PRs, the time of the last approval
For merged PRs with review requested, but never given, merge time is not defined
For merged PRs without reviews requested, time of last commit, or PR open time, whichever comes last End is the same as the end of cycle time. If the PR is still open, the current time is considered the end.

linked_swarmia_issue_id

UTF8

Linked swarmia issue

is_revert

Boolean

Did the pull request revert existing pull request

changed_files

Array, UTF8

Array of changed file names

pull_request_review_requests

field

type

required

description

UTF8

UUID

pull_request_id

UTF8

Linked pull request, UUID

requested_by_atuhor_id

UTF8

Who requested review

requested_from_author_id

UTF8

From whom review was requested

requested_from_team_name

UTF8

Team whom review was requested

source_id

UTF8

e.g GitHub's ID

requested_at

TIMESTAMP_MICROS

When review was requested

pull_request_reviews

field

type

required

description

UTF8

UUID

pull_request_id

UTF8

Linked pull request, UUID

author_id

UTF8

Reviewed by, UUID

author_external_identifier

UTF8

Reviewed by's nickname

source_id

UTF8

e.g GitHub's ID

state

UTF8

One of the following : approved, changes_requested, commented

submitted_at

TIMESTAMP_MICROS

When review was submitted

github_repository_full_name

UTF8

Deprecated, repository name

source_repository_full_name

UTF8

Repository name

pull_request_review_request_id

UTF8

Review request linked to review, UUID

pull_request_comments

field

type

required

description

UTF8

UUID

pull_request_id

UTF8

Linked pull request, UUID

source_id

UTF8

e.g GitHub's ID

body_text

UTF8

Comment content

created_at

TIMESTAMP_MICROS

When the comment was posted

author_id

UTF8

Who posted the comment, UUID

author_is_bot

BOOLEAN

Is the author a bot

deployments

field

type

required

description

UTF8

UUID

pull_request_ids

Array, UTF8

Linked pull requests requests within deployment

deployment_environment

INT64

E.g production

deployed_at

UTF8

When deployment was done

fixes_deployment_id

UTF8

If deployment fixes previous deployment

fixes_version

UTF8

Version of the fixed deployment

deployment_version

UTF8

deployment_app_name

UTF8

involved_team_ids

Array, UTF8

involved_team_names

Array, UTF8

involved_author_ids

Array, UTF8

time_to_recovery_seconds

INT64

change_lead_time_seconds

INT64

time_to_deploy_seconds

INT64

is_change_failure

BOOLEAN

change_failure_source

UTF8

authors

field

type

required

description

UTF8

UUID

name

UTF8

Author's name

associated_emails

Array, UTF8

Author's all email addresses

PreviousData export NextExport data as a CSV file

Last updated 2 months ago

Was this helpful?