This script pulls data from politikus.sinarproject.org and cache it in networkx to enable offline processing. The cache can then be saved into a Neo4j database for further processing and visualization.
The project depends on the following tools / python package in order to build and install properly.
- Python 3.9 and up
- While the development work targets Neo4j 4.1, earlier version should work.
- Poetry - follow the installation instruction found here.
- Python wheel - you can install via pip
pip3 install wheel - In order to generate graph, python would need to be compiled to work with
tk-devpackage on Ubuntu.
- Clone this project
git clone https://github.com/Sinar/popit_relationship cd popit_relationship - Install and build the project
poetry build
Install the built project with pip (filename of the .whl file may vary). Please ensure your PATH is configured properly.
pip3 install ./dist/popit_relationship-0.1.0-py3-none-any.whl
If you are reinstalling after pulling the latest changes, add a --force-reinstall flag
pip3 install --force-reinstall ./dist/popit_relationship-0.1.0-py3-none-any.whl
Most of the configuration is saved within .env file, please refer to the .env.example for example. Besides NEO4J_AUTH and NEO4J_URI, the script should work with the default settings.
NEO4J_AUTHstores the username and passsword pair separated by a backslash character/, e.g.neo4j/s0meCompl!catedPasswordNEO4J_URIstores the URI to the neo4j database, e.g.bolt:hostname:7687ENDPOINT_APIstores the ENDPOINT API URI, currently defaulted tohttps://politikus.sinarproject.org/@search, the script should work with other similar APIsCRAWL_INTERVALstores the time to wait between every API call (defaulted to1second)CACHE_PATHstores the path to the cache file (defaulted to./primport-cache.gpickle)GRAPHML_PATHstores the path used byprimport export graphml(defaulted to./primport-cache.graphml)
The configuration environment variables can be overwritten while executing the script (please refer to the usage examples below).
After following the installation guide, if the python environment is properly configured, a script named primport should be made available. Sub-commands can then be issued for different tasks.
Configuration options can be overriden as environment variables, e.g. when running primport in Bash
NEO4J_AUTH=neo4j/someOtherPassword primport reset db
primport reset cacheresets the cache fileprimport reset dbclears the Neo4j database
primport sync personfetches thePersonAPIprimport sync orgfetches theOrganizationAPIprimport sync postfetches thePostAPIprimport sync membershipfetches theMembershipAPIprimport sync relationshipfetches theRelationshipAPIprimport sync ownershipfetches theOwnership Control StatementAPIprimport sync allfetches all of the aboveprimport visualize $node1 [$node2 $node3 ...]generates a graph from cache including$node1($node2,$node3etc are optional).- Each
$nodeis a URI to an entity, for instancehttps://politikus.sinarproject.org/organizations/government-linked-companies/1mdb-real-estate-sdn-bhd - The maximum depth can be overwritten by passing
--depthflag, eg.--depth=1(value is defaulted to3).
- Each
primport savesaves the cached data to the Neo4j database to allow further work.
primport export graphmlwrites the cached graph toGRAPHML_PATHprimport export graphml ./primport-cache.graphmlwrites the cached graph to the specified GraphML file using NetworkX's built-in GraphML exporter
- The script can be executed normally as follows
(Just replace
git clone https://github.com/Sinar/popit_relationship cd popit_relationship poetry install poetry run python src/popit_relationship/primport.py reset dbprimportwithpoetry run python src/popit_relationship/primport.py)
Test is done through PyTest
poetry run pytest