Finding leads in big data lakes: State of the art of the FollowTheMoney toolkit

# Finding leads in big data lakes: State of the art of the FollowTheMoney toolkit This pad: **https://l.idio.is/ftm-scicar24** Tutorials: https://pad.investigativedata.org/8Y-njO7qROeo0U7OOpplvQ# This session: https://24.scicar.de/scicar24/talk/ZS3C3K/ ## contact Simon Wörpel, simon@investigativedata.org investigativedata.io ## what's out there (aka "showcase") - opensecuritydata.eu - farmsubsidy.org - spendengerichte.correctiv.org - followthegrant.org - opensanctions.org - aleph.occrp.org - aleph.investigativedata.org - investigraph.eu ## Investigative Data Journalism / Computer Assisted Reporting **Things of interest: Persons, Companies, Events, and how all of these are connected** ### Challenges - Different names for same persons across multiple datasets - Sometimes we know exact dates for events, sometimes only the year - For some _Things_ we have more granular information than for others - For some _Things_ we have different information over time ### Research questions - How similar is Person 1 to Person 2? - Find same candidates from multiple lists - Which of these candidates show up in my _Documents_? ### What we need - Standardized concepts to define _Things_ from the real world into a computer program - Reproducible data generation - Library-like dataset and catalog archiving and sharing ## FollowTheMoney :mag_right: An investigative research method :globe_with_meridians: A data standard (ontology) to describe investigative subjects such as Persons, Companies, and their relationships :slot_machine: A set of tools that implement :mag_right: and :globe_with_meridians: ## 1. The method Misuse of power, influence on politics, lobbyism, corruption, and most of the bad things in the world are only happening because of :moneybag::moneybag::moneybag: (and :older_man:) To investigate such things, "_just follow the money_" To _just follow the money_, researchers need a lot of data about _Things_ and their connections. ## 2. The ontology ### Things :smirk: Let's talk about me - Schema: `Person` - Properties: - name: `Simon Wörpel` - birth date: `1989-04` :factory: my organization - Schema: `Company` - Label: `IDIO Daten Import Export GmbH` - Properties: - Website: `investigativedata.io` ### Intervals :smirk: -> :factory: - Schema: `Directorship` - Properties: - director: :smirk: - organization: :factory: - start date: `2023-04-26` :factory: -> :moneybag: -> :smirk: - Schema: `Payment` - Properties: - payer: :factory: - beneficiary: :smirk: - amount: `2000` ## 3. The tools ### machine-readable data representation: JSON ```json { "id": "simon", "schema": "Person", "properties": { "name": ["Simon Wörpel"], "birthDate": ["1989-04"], "firstName": ["Simon"], "lastName": ["Wörpel"] } } { "id": "de-361014839", "schema": "Company", "properties": { "name": ["IDIO Daten Import Export GmbH"], "incorporationDate": ["2023-04-26"], "website": ["https://investigativedata.io"], "legalForm": ["Gesellschaft mit beschränkter Haftung"] } } { "id": "de-361014839-directorship-simon", "schema": "Directorship", "properties": { "director": ["simon"], "organization": ["de-361014839"], "startDate": ["2023-04-26"] } } ``` ### data features - All properties are _multi valued_ - All properties are _strings_ (but we know about types) - Dates can be _fuzzy_ (`2024`, `2024-04`, `2024-04-04` are all valid formats) ### the base library `followthemoney` [Documentation](https://followthemoney.tech) - Defines the model - Transforms arbitrary input data into the ontology - Base logic for comparison and _merging_ ```bash cat data.csv | ftm map-csv mapping.yml > entities.ftm.json ``` - The model (can be customized): https://followthemoney.tech/explorer/ - Mapping intro: https://docs.aleph.occrp.org/developers/how-to/data/import-tabular-data/#mapping-tabular-data ### wait, documents? Any document is just another _Thing_, really! ```json { "id": "22596363b3de40b06f981fb85d82312e8c0ed511", "schema": "PlainText", "properties": { "fileName": ["hello.txt"], "contentHash": ["22596363b3de40b06f981fb85d82312e8c0ed511"], "bodyText": ["hello world"] } } ``` ### Aleph ![](https://s3.investigativedata.org/hedgedoc-minyos/uploads/7cd641b3-8a00-49ba-abcf-636cb3e32277.png) Aleph is an Open Source software that can store huge datasets and make them searchable in a collaborative but secure way. It is mostly known for the [public instance](https://aleph.occrp.org) offered by the [Organized Crime and Corruption Project (OCCRP)](https://occrp.org), but any organization and research team can have it's own, exclusive and independent instance of it. - https://aleph.occrp.org - https://aleph.investigativedata.org - A very sophisticated "FtM Viewer" (and ingestor) - Scalable research platform and data archive - Organize data in collections - Upload and analyze documents - Cross-matching between datasets ## Beyond Aleph ### `nomenklatura` Nomenklatura de-duplicates and integrates different Follow the Money entities. It serves to clean up messy data and to find links between different datasets. - Developed by [OpenSanctions](https://opensanctions.org) - Github: https://github.com/opensanctions/nomenklatura ![](https://s3.investigativedata.org/hedgedoc-minyos/uploads/862fcd60-0db4-4506-b140-f8e578b89ff0.png) - Next generation deduplication - Intorudinc a new way to store information: ### Statements `nomenklatura` (the base data layer for [OpenSanctions](https://opensanctions.org)) stores the _Entities_ in a statement-based model as a _list of observations_. The OpenSanctions database is designed to meet the following design objectives: - Be able to dynamically merge and un-merge entities from a broad range of data sources. - Be able to identify the origin of each piece of information about a sanctions target or other entity. - Track entities and their properties as they change over time. In order to meet these goals, the system uses a statement-based database design. To illustrate this, think of a claim like this one: the US sanctions list, as of the most recent update, claims that entity `ofac-12345` has the property `name` set to the value `John Doe`. [Learn more](https://www.opensanctions.org/docs/statements/) #### example :smirk: | entity_id | dataset | schema | property | value | seen | source_url | |----|---------|--------|----------|-------|-----------|------------| | 1 | scicar24 | Person | firstName | Simon | 2024-09-28 | https://wrpl.de | | 1 | scicar24 | Person | position | Data journalist | 2015-03-21 | https://correctiv.org/team/simon-woerpel/ | | 1 | scicar24 | Person | position | Managing director | 2023-05-07 | https://investigativedata.io/contact/ | Putins names: https://www.opensanctions.org/statements/Q7747/?prop=name ### `ftmq` Builds on top of `nomenklatura` and offers a simple storage and querying interface for followthemoney entities stored in the statement based format. https://github.com/investigativedata/ftmq ### `ftmq-api` Exposes `followthemoney` data via a python api: https://github.com/investigativedata/ftmstore-fastapi ### `ftm-joy-ui` [React](https://react.dev) components based on [Joy UI](https://mui.com/joy-ui/getting-started/) for rendering _entities_. ### `investigraph` Builds on top of the stack above and offers an easy to use interface to map, transform and load `followthemoney` data. https://investigraph.dev ## Putting it all together: Building the lake _Aleph_ as well as _nomenklatura_ do specify a metadata model for a _Dataset_, which is a collection of `followthemoney` entities. This allows us to define metadata and resource links for a _Dataset_ and share it for other organizations. Example dataset metadata: https://data.ftm.store/gdho/index.json ```json { "name": "gdho", "title": "Global Database of Humanitarian Organisations", "summary": "GDHO is a global compendium of organisations that provide aid in humanitarian\ncrises. The database includes basic organisational and operational\ninformation on these humanitarian providers, which include international\nnon-governmental organisations (grouped by federation), national NGOs that\ndeliver aid within their own borders, UN humanitarian agencies, and the\nInternational Red Cross and Red Crescent Movement.", "updated_at": "2023-10-02T00:20:36", "resources": [ { "name": "entities.ftm.json", "url": "https://data.ftm.store/gdho/entities.ftm.json", "mime_type": "application/json+ftm", "mime_type_label": "FollowTheMoney Entities" } ], "children": [], "publisher": { "name": "Humanitarian Outcomes", "url": "https://www.humanitarianoutcomes.org/", "description": "Humanitarian Outcomes is a team of specialist consultants providing\nresearch and policy advice for humanitarian aid agencies and donor\ngovernments.", "official": false } } ``` ### Many datsets = 1 Catalog #### The catalogs for normal people :information_desk_person: :person_with_blond_hair: https://www.opensanctions.org/datasets/sanctions/ https://catalog.investigativedata.io #### for machines investigraph.eu catalog: https://data.ftm.store/catalog.json OpenSanctions.org catalog: https://data.opensanctions.org/datasets/latest/index.json ## Remember: Documents are just Entities, too ### The next experiment: `leakrfc` https://leak-rfc.org `leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The package and concepts are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer). `leakrfc` acts as a standardized storage and retrieval mechanism for sharing document collections and import them into various analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](docs.aleph.occrp.org/).