# Finding leads in big data lakes: State of the art of the FollowTheMoney toolkit
This pad: **https://l.idio.is/ftm-scicar24**
Tutorials: https://pad.investigativedata.org/8Y-njO7qROeo0U7OOpplvQ#
This session: https://24.scicar.de/scicar24/talk/ZS3C3K/
## contact
Simon Wörpel, simon@investigativedata.org
investigativedata.io
## what's out there (aka "showcase")
- opensecuritydata.eu
- farmsubsidy.org
- spendengerichte.correctiv.org
- followthegrant.org
- opensanctions.org
- aleph.occrp.org
- aleph.investigativedata.org
- investigraph.eu
## Investigative Data Journalism / Computer Assisted Reporting
**Things of interest: Persons, Companies, Events, and how all of these are connected**
### Challenges
- Different names for same persons across multiple datasets
- Sometimes we know exact dates for events, sometimes only the year
- For some _Things_ we have more granular information than for others
- For some _Things_ we have different information over time
### Research questions
- How similar is Person 1 to Person 2?
- Find same candidates from multiple lists
- Which of these candidates show up in my _Documents_?
### What we need
- Standardized concepts to define _Things_ from the real world into a computer program
- Reproducible data generation
- Library-like dataset and catalog archiving and sharing
## FollowTheMoney
:mag_right: An investigative research method
:globe_with_meridians: A data standard (ontology) to describe investigative subjects such as Persons, Companies, and their relationships
:slot_machine: A set of tools that implement :mag_right: and :globe_with_meridians:
## 1. The method
Misuse of power, influence on politics, lobbyism, corruption, and most of the bad things in the world are only happening because of :moneybag::moneybag::moneybag: (and :older_man:)
To investigate such things, "_just follow the money_"
To _just follow the money_, researchers need a lot of data about _Things_ and their connections.
## 2. The ontology
### Things
:smirk: Let's talk about me
- Schema: `Person`
- Properties:
- name: `Simon Wörpel`
- birth date: `1989-04`
:factory: my organization
- Schema: `Company`
- Label: `IDIO Daten Import Export GmbH`
- Properties:
- Website: `investigativedata.io`
### Intervals
:smirk: -> :factory:
- Schema: `Directorship`
- Properties:
- director: :smirk:
- organization: :factory:
- start date: `2023-04-26`
:factory: -> :moneybag: -> :smirk:
- Schema: `Payment`
- Properties:
- payer: :factory:
- beneficiary: :smirk:
- amount: `2000`
## 3. The tools
### machine-readable data representation: JSON
```json
{
"id": "simon",
"schema": "Person",
"properties": {
"name": ["Simon Wörpel"],
"birthDate": ["1989-04"],
"firstName": ["Simon"],
"lastName": ["Wörpel"]
}
}
{
"id": "de-361014839",
"schema": "Company",
"properties": {
"name": ["IDIO Daten Import Export GmbH"],
"incorporationDate": ["2023-04-26"],
"website": ["https://investigativedata.io"],
"legalForm": ["Gesellschaft mit beschränkter Haftung"]
}
}
{
"id": "de-361014839-directorship-simon",
"schema": "Directorship",
"properties": {
"director": ["simon"],
"organization": ["de-361014839"],
"startDate": ["2023-04-26"]
}
}
```
### data features
- All properties are _multi valued_
- All properties are _strings_ (but we know about types)
- Dates can be _fuzzy_ (`2024`, `2024-04`, `2024-04-04` are all valid formats)
### the base library
`followthemoney` [Documentation](https://followthemoney.tech)
- Defines the model
- Transforms arbitrary input data into the ontology
- Base logic for comparison and _merging_
```bash
cat data.csv | ftm map-csv mapping.yml > entities.ftm.json
```
- The model (can be customized): https://followthemoney.tech/explorer/
- Mapping intro: https://docs.aleph.occrp.org/developers/how-to/data/import-tabular-data/#mapping-tabular-data
### wait, documents?
Any document is just another _Thing_, really!
```json
{
"id": "22596363b3de40b06f981fb85d82312e8c0ed511",
"schema": "PlainText",
"properties": {
"fileName": ["hello.txt"],
"contentHash": ["22596363b3de40b06f981fb85d82312e8c0ed511"],
"bodyText": ["hello world"]
}
}
```
### Aleph
![](https://s3.investigativedata.org/hedgedoc-minyos/uploads/7cd641b3-8a00-49ba-abcf-636cb3e32277.png)
Aleph is an Open Source software that can store huge datasets and make them searchable in a collaborative but secure way. It is mostly known for the [public instance](https://aleph.occrp.org) offered by the [Organized Crime and Corruption Project (OCCRP)](https://occrp.org), but any organization and research team can have it's own, exclusive and independent instance of it.
- https://aleph.occrp.org
- https://aleph.investigativedata.org
- A very sophisticated "FtM Viewer" (and ingestor)
- Scalable research platform and data archive
- Organize data in collections
- Upload and analyze documents
- Cross-matching between datasets
## Beyond Aleph
### `nomenklatura`
Nomenklatura de-duplicates and integrates different Follow the Money entities. It serves to clean up messy data and to find links between different datasets.
- Developed by [OpenSanctions](https://opensanctions.org)
- Github: https://github.com/opensanctions/nomenklatura
![](https://s3.investigativedata.org/hedgedoc-minyos/uploads/862fcd60-0db4-4506-b140-f8e578b89ff0.png)
- Next generation deduplication
- Intorudinc a new way to store information:
### Statements
`nomenklatura` (the base data layer for [OpenSanctions](https://opensanctions.org)) stores the _Entities_ in a statement-based model as a _list of observations_.
The OpenSanctions database is designed to meet the following design objectives:
- Be able to dynamically merge and un-merge entities from a broad range of data sources.
- Be able to identify the origin of each piece of information about a sanctions target or other entity.
- Track entities and their properties as they change over time.
In order to meet these goals, the system uses a statement-based database design. To illustrate this, think of a claim like this one: the US sanctions list, as of the most recent update, claims that entity `ofac-12345` has the property `name` set to the value `John Doe`.
[Learn more](https://www.opensanctions.org/docs/statements/)
#### example
:smirk:
| entity_id | dataset | schema | property | value | seen | source_url |
|----|---------|--------|----------|-------|-----------|------------|
| 1 | scicar24 | Person | firstName | Simon | 2024-09-28 | https://wrpl.de |
| 1 | scicar24 | Person | position | Data journalist | 2015-03-21 | https://correctiv.org/team/simon-woerpel/ |
| 1 | scicar24 | Person | position | Managing director | 2023-05-07 | https://investigativedata.io/contact/ |
Putins names: https://www.opensanctions.org/statements/Q7747/?prop=name
### `ftmq`
Builds on top of `nomenklatura` and offers a simple storage and querying interface for followthemoney entities stored in the statement based format.
https://github.com/investigativedata/ftmq
### `ftmq-api`
Exposes `followthemoney` data via a python api:
https://github.com/investigativedata/ftmstore-fastapi
### `ftm-joy-ui`
[React](https://react.dev) components based on [Joy UI](https://mui.com/joy-ui/getting-started/) for rendering _entities_.
### `investigraph`
Builds on top of the stack above and offers an easy to use interface to map, transform and load `followthemoney` data.
https://investigraph.dev
## Putting it all together: Building the lake
_Aleph_ as well as _nomenklatura_ do specify a metadata model for a _Dataset_, which is a collection of `followthemoney` entities. This allows us to define metadata and resource links for a _Dataset_ and share it for other organizations.
Example dataset metadata:
https://data.ftm.store/gdho/index.json
```json
{
"name": "gdho",
"title": "Global Database of Humanitarian Organisations",
"summary": "GDHO is a global compendium of organisations that provide aid in humanitarian\ncrises. The database includes basic organisational and operational\ninformation on these humanitarian providers, which include international\nnon-governmental organisations (grouped by federation), national NGOs that\ndeliver aid within their own borders, UN humanitarian agencies, and the\nInternational Red Cross and Red Crescent Movement.",
"updated_at": "2023-10-02T00:20:36",
"resources": [
{
"name": "entities.ftm.json",
"url": "https://data.ftm.store/gdho/entities.ftm.json",
"mime_type": "application/json+ftm",
"mime_type_label": "FollowTheMoney Entities"
}
],
"children": [],
"publisher": {
"name": "Humanitarian Outcomes",
"url": "https://www.humanitarianoutcomes.org/",
"description": "Humanitarian Outcomes is a team of specialist consultants providing\nresearch and policy advice for humanitarian aid agencies and donor\ngovernments.",
"official": false
}
}
```
### Many datsets = 1 Catalog
#### The catalogs for normal people :information_desk_person: :person_with_blond_hair:
https://www.opensanctions.org/datasets/sanctions/
https://catalog.investigativedata.io
#### for machines
investigraph.eu catalog: https://data.ftm.store/catalog.json
OpenSanctions.org catalog: https://data.opensanctions.org/datasets/latest/index.json
## Remember: Documents are just Entities, too
### The next experiment: `leakrfc`
https://leak-rfc.org
`leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The package and concepts are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer).
`leakrfc` acts as a standardized storage and retrieval mechanism for sharing document collections and import them into various analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](docs.aleph.occrp.org/).