Skip to main content
Data Dictionary 101

Understanding provenance of a dataset in Seamless Horizons

W
Written by William Gibson
Updated over a year ago

The data dictionary is a one-stop for information about the datasets that make up Seamless Horizons and any other external dataset that might be useful for investigators. We store all of the metadata information about where a dataset came from or where it can be accessed from, when it was collected and added to Seamless Horizons, and other information about the source that help analysts and investigator use the dataset most effectively.

How to Access

To access the data dictionary you can navigate from the knowledge wiki either from the home page or by clicking the link on the navbar. This will take you to the landing page for the data dictionary where you can filter and search all of the datasets that you have access to.

Alternatively, if you want to view information about an individual dataset you can navigate directly to the data dictionary page from the search page of the application. This is particularly helpful if you get a result and are interested in learning more about the source of the information while you are searching. To do this, simply click the knowledge wiki icon on the right side of either the result panel or the data preview panel to navigate directly to that dataset entry.

Step 3 screenshot

Navigating the Data Dictionary

From the data dictionary page you have several features that will help you explore the data in the system and find useful results. At the top of the landing page is a single search bar. This allows you to search not only for a dataset name but also content within the dataset description. For example, if I couldn't remember the UN Panel of Expert dataset title and I type "UN investigation" in the search bar, it returns results for the UN Panel of Expert Report dataset because of content in the dataset description.

There are four filters available for the users to explore datasets:

  • Type - This is the type of data that we collect from the dataset. This dropdown includes values like Corporate Registry, Financial Data, Trade Data, etc. so that you can filter do to just see all of the datasets of a particular grouping.

  • Country - This is a way to view dataset coverage by each country, allowing the ability to view all datasets related to Russia, for example.

  • Region - If you are interested in only our coverage of Europe you can also use this to filter to the region instead of selecting multiple countries.

  • Language - Allows users to filter down to only datasets of a target language, so if they are only doing Chinese language research they can see our full coverage.

The toggle on the right side of the search bar which says 'Only show datasets with searchable documents' allows users to filter down to only datasets that we have collected and stored in Seamless Horizons and filters out external dataset links.

To view a data dictionary entry, simply click the line under the search and filter panel.

Understanding Data Dictionary Entries

From the data dictionary entry you are able to topline metadata about each dataset including:

  • Records count: Total number of records for each dataset

  • Countries: All countries covered in a dataset, including 'Global' datasets

  • Date of Snapshot: This is the date that the data was collected by C4ADS

  • Type: The type of data that the dataset represents

  • Regions: This is helpful for filtering datasets down on the search page

  • Languages: The language the source data, although C4ADS may have enriched and translated or transliterated portions of the original data

  • Sensitivity: This is an internal flag for C4ADS to determine procedures for giving access for external partners using Seamless Horizons.

In the body of the data dictionary we have written out some content to help you understand what a dataset is showing and why we believe that it was high-signal enough to collect and store in Seamless Horizons. The fields include:

  • Summary: A basic summary of the dataset that is also searchable from the data dictionary page.

  • Credibility: Notes on whether or not the source information is credible. For example, an official corporate registry hosted on a government website is far more credible than a leaked dataset acquired from an unknown source. We strive to take good notes here to notify users so that they can make their own analytical inferences about trustworthiness of a dataset.

  • Coverage: This includes some notes on what information is covered in the collected data and whether or not it reflects the full scope of available data.

  • Search Tips and Tricks: If there are any notes that help users to understand how best to extract meaning from a dataset we store that in this section of the dataset entry.

  • Sample Records: Shows a single record of a dataset and what is included.

Did this answer your question?