Data Export
Introduction
Scarf provides a robust platform for tracking package downloads and pixel views. The ability to export this data is crucial for analytics, reporting, and integrating with other tools. This guide aims to provide a clear and concise explanation of how to export data from Scarf, what data is exported, and how to make use of any available integrations.
Prerequisites
Exporting data from Scarf will only work if you are on a Scarf Paid Plan.
How to Export Event Data
Scarf Dashboard
To export data out of Scarf,
- go to the main dashboard and
- click "Export packages data".
This will export all data, for the default period, over the past month.
The data you can export from Scarf includes all events (defined as package downloads and pixel views) from every user that has interacted with your Scarf-enabled artifacts (packages and pixels). Upon clicking "Export packages data", this data will download as a .csv file.
Scarf API
You can also export this data using the Scarf API.
What Data is Exported
Export Fields
The event data export includes the following data fields
name | type | description |
---|---|---|
id | text |
This uniquely identifies the event (pixel view or package download) that occurred. |
type | text |
This categorizes the type of event that occurred (e.g. pixel-fetch, manifest-fetch, binary-download, etc.). |
package | text |
For Scarf package downloads, this specifies which package has been downloaded. |
pixel | text |
For Scarf page views, this specifies which pixel has been downloaded. |
version | text |
For Scarf package downloads, this specifies which version of the package has been downloaded. |
time | timestamp |
This refers to the time in UTC that the event occurred. |
referer | text |
For Scarf pixel views, this refers to the page that was viewed. |
user_agent | text |
This refers to the User-Agent, which provides information around the method of installation, often including information such as operating system, device, browser, architecture, and client. |
variables | text |
This refers to any custom-specified variables that you might use Scarf to track in file package downloads. |
origin_id | text |
This uniquely identifies the user (through a specific device) who has interacted with a Scarf event. |
origin_latitude | numeric |
This is the latitude of the location Scarf is able to identify for the event. |
origin_longitude | numeric |
This is the longitude of the location Scarf is able to identify for the event. |
origin_country | text |
This is the country of the location Scarf is able to identify for the event. |
origin_city | text |
This is the city of the location Scarf is able to identify for the event. |
origin_state | text |
This is the state of the location Scarf is able to identify for the event. |
origin_postal | text |
This is the postal code (ZIP code, in the US) of the location Scarf is able to identify for the event. |
origin_connection_type | text |
This categorizes the type of IP address Scarf is able to identify (e.g. business, isp, hosting, etc.). |
origin_company | text |
If Scarf is able to associate the event with a known business entity, that business entity is listed here. |
origin_domain | text |
If Scarf is able to associate the event with a known business entity, that business entity's web domain address is listed here. |
dnt | boolean |
If the user includes a DNT request in their header, that is logged here and they will not be tracked. |
confidence | numeric |
The probability of correct identification of the data. |
endpoint_id | text |
This uniquely identifies the public-facing device that has interacted with a Scarf event. Unlike origin_id, it is notably not sensitive to changes in device information like client, user agent, etc. |
mtc_quota_exceeded | boolean |
A value of true indicates the company information from the event data row was scrubbed due to exceeding the MTC limit. |
How to Export Aggregate Data
The documentation for exporting aggregates can be found in Export aggregates. Here's an example curl request to download aggregate data. The output is newline delimited json.
curl -o {filename}.jsonl \
-H "Authorization: Bearer {token}" \
-H "Content-Type: application/x-ndjson" \
"https://api.scarf.sh/v2/packages/{owner}/aggregates?start_date={start_date}&end_date={end_date}&breakdown=by-company"
How to Export Company Data
The documentation for exporting company data that is rolled up with a daily interval can be found in Export Company Data
Here's an example curl request to download company rolled up data.
curl -o company-rollup.csv \
-H "Authorization: Bearer {token} \
-H "Content-Type: text/csv" \
https://api.scarf.sh/v2/packages/{owner}/company-rollup
The company data export includes the following data fields.
name | type | description |
---|---|---|
company_name | text |
Name of the company |
company_domain | text |
Domain of the company. Eg. scarf.sh |
funnel_stage | text |
Stage of a company's journey in using your software |
total_events | numeric |
Count of total events |
unique_sources | numeric |
Number of distinct sources of traffic that comprise the total event count from this organization. |
first_seen | text |
Date of when the first event occured |
last_seen | text |
Date of when the last event occured |
company_linkedin_url | text |
A company's LinkedIn link |
company_industry | text |
A company's industry. Eg. Tech, Government, etc. |
company_size | text |
A company's approximated employee count |
company_country | text |
A company's country location |
company_state | text |
A company's state location |
interest_start_date | text format: yyyy-mm-dd |
Date when a company started in the interest funnel_stage |
investigation_start_date | text format: yyyy-mm-dd |
Date when a company started in the investigation funnel_stage |
experimentation_start_date | text format: yyyy-mm-dd |
Date when a company started in the experimentation funnel_stage |
ongoing_usage_start_date | text format: yyyy-mm-dd |
Date when a company started in the ongoing usage funnel_stage |
inactive_start_date | text format: yyyy-mm-dd |
Date when a company started in the inactive funnel_stage |
How to Export Company Events
The documentation for exporting company events can be found in Export Company Events. Here's and exampe curl request to download company events data.
curl -o company-events.csv \
-H "Authorization: Bearer {token}" \
-H "Content-Type: text/csv" \
"https://api.scarf.sh/v2/companies/{owner}/{domain}/events?start_date={start_date}&end_date={end_date}"
Integrations
Scarf to PostgreSQL
GitHub: https://github.com/scarf-sh/scarf-postgres-exporter
Overview
The Scarf to PostgreSQL Exporter is a script designed to pull down your raw Scarf data and send it into a PostgreSQL database. This script is intended to be run as a daily batch job. It provides an automated way to backfill and update your PostgreSQL database with Scarf's enhanced data.
Prerequisites
psql
must be installed and available in your environment (or use the Docker container with everything you need).- A Scarf Account.
- Your Scarf API token. You can find your API Token from your user settings page.
Settings
The following environment variables are required:
SCARF_API_TOKEN
: Your Scarf API access token.SCARF_ENTITY_NAME
: Your Scarf username or the name of your organization.PSQL_CONN_STRING
: The PostgreSQL connection string.
optional
BACKFILL_DAYS
: Number of days to backfill data. Defaults to 31 if not set.
Getting Started
For more details, you can visit the GitHub repository.
Future Integrations
Integrations are in development, if you have particular data sources you'd like Scarf to integrate with, we'd love to hear from you.
Daily Scheduled Exports
Schedule a daily export via the API endpoint https://api.scarf.sh/v2/exports/{owner}/schedule-export. This export has the capability to export both raw events and company rollup aggregated data to the bucket indicated in the s3_uri_destination field. The S3 uri that you submit will be considered as the bucket name. Do not specify an object key. The service will generate the object key with the format <events|company-rollups>-scarf-export-<start date>-<end date>.csv
.
Setting up your S3 account
Create a policy that states we can assume a role. Here's an example of that policy. This example is a highly permissive role. If you want to customize the role, please refer to the proper AWS documentation.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sts:AssumeRole",
"sts:*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::<bucket-name>/<folder-a>/<subfolder-a>/*"
]
}
]
}
arn:aws:iam::<account-id>:role/<role-name>
032190491485
.
After creating the role, go to the "Trust relationships" and add the following trust policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::032190491485:user/production-v2-scarf-server"
},
"Action": "sts:AssumeRole"
}
]
}
The ARN role is what you will need in the arn_role
api field.
This is not an exhaustive documentation of how to setup a shared s3 bucket. Please refer to the AWS documentation for more information.