A new version of the CIDACS-RL linkage tool suitable to a cluster computing environment.

Overview

Fully Distributed CIDACS-RL

The CIDACS-RL is a brazillian record linkage tool suitable to integrate large amount of data with high accuracy. However, its current implementation relies on a ElasticSearch Cluster to distribute the queries and a single node to perform them through Python Multiprocessing lib. This implementation of CIDACS-RL tool can be deployed in a Spark Cluster using all resources available by Jupyter Kernel still using the ElasticSearch cluster, becaming a fully distributed and cluster based solution. It can outperform the legacy version of CIDACS-RL either on multi-node or single node Spark Environment.

config.json

Almost all the aspects of the linkage can be manipulated by the config.json file.

Section Sub-section Field (datatype) Field description
General info index_data (str<'yes', 'no'>) This flag says if the linkage process includes the indexing of a data set into elastic search. Constraints: string, it can assume the values "yes" or "no".
General info es_index_name (str<ES_VALID_INDEX>) The name of an existing elasticsearch index (if index_data is 'no') or a new one (if index_data is 'yes'). Constraints: string, elasticsearch valid.
General info es_connect_string (str<ES_URL:ES_PORT>) Elasticsearch API address. Constraints: string, URL format.
General info query_size (int) Number of candidates output for each Elasticsearch query. Constraints: int.
General info cutoff_exact_match (str<0:1 number>) Cutoff point to determine wether a pair is an exact match or not. Constraints: str, number between 0 and 1.
General info null_value (str) Value to replace missings on both data sets involved. Constraints: string.
General info temp_dir (str) Directory used to write checkpoints for exact match and non-exact match phases. Constraints: string, fully qualified path.
General info debug (str<'true', 'false'>) If it is set as "true", all records found on exact match will be queried again on non-exact match phase.
Datasets info Indexed dataset path (str) Path for csv or parquet folder of dataset to index.
Datasets info Indexed dataset extension (str<'csv', 'parquet'>) String to determine the type of data reading on Spark.
Datasets info Indexed dataset columns (list) Python list with column names involved on linkage.
Datasets info Indexed dataset id_column_name (str) Name of id column.
Datasets info Indexed dataset storage_level (str<'MEMORY_AND_DISK', 'MEMORY_ONLY'>) Directive for memory allocation on Spark.
Datasets info Indexed dataset default_paralelism (str<4*N_OF_AVAILABLE_CORES>) Number of partitions of a given Spark dataframe.
Datasets info tolink dataset path (str) Path for csv or parquet folder of dataset to index.
Datasets info tolink dataset extension (str<'csv', 'parquet'>) String to determine the type of data reading on Spark.
Datasets info tolink dataset columns (list) Python list with column names involved on linkage.
Datasets info tolink dataset id_column_name (str) Name of id column.
Datasets info tolink dataset storage_level (str<'MEMORY_AND_DISK', 'MEMORY_ONLY'>) Directive for memory allocation on Spark.
Datasets info tolink dataset default_paralelism (str<4*N_OF_AVAILABLE_CORES>) Number of partitions of a given Spark dataframe.
Datasets info result dataset path (str) Path for csv or parquet folder of dataset to index.
Comparisons label1 indexed_col (str) Name of first column to be compared on indexed dataset
Comparisons label1 tolink_col (str) Name of first column to be compared on tolink dataset
Comparisons label1 must_match (str<'true', 'false'>) Set if this pair of columns are included on exact match phase
Comparisons label1 should_match (str<'true', 'false'>) Set if this pair of columns are included on non-exact match phase
Comparisons label1 is_fuzzy (str<'true', 'false'>) Set if this pair of columns are included on fuzzy queries for non-exact match phase
Comparisons label1 boost (str) Set the boost/weight of this pair of columns on queries
Comparisons label1 query_type (str<'match', 'term'>) Set the type of matching for this pair of columns on non-exact match phase
Comparisons label1 similarity (str<'jaro_winkler', 'overlap', 'hamming'> Set the similarity to be calculated between the values of this pair of columns
Comparisons label1 weight (str) Set the weight of this pair of columns.
Comparisons label1 penalty (str) Set the penalty of the overall similarity in case of missing value(s).
Comparisons label2 ... ...

config.json example


{
 'index_data': 'no',
 'es_index_name': 'fd-cidacs-rl',
 'es_connect_string': 'http://localhost:9200',
 'query_size': 100,
 'cutoff_exact_match': '0.95',
 'null_value': '99',
 'temp_dir': '../../../0_global_data/fd-cidacs-rl/temp_dataframe/',
 'debug': 'false',
 
 'datasets_info': {
    'indexed_dataset': {
        'path': '../../../0_global_data/fd-cidacs-rl/sinthetic-dataset-A.parquet',
        'extension': 'parquet',
        'columns': ['id_cidacs_a', 'nome_a', 'nome_mae_a', 'dt_nasc_a', 'sexo_a'],
        'id_column_name': 'id_cidacs_a',
        'storage_level': 'MEMORY_ONLY',
        'default_paralelism': '16'},
    'tolink_dataset': {
        'path': '../../../0_global_data/fd-cidacs-rl/sinthetic-datasets-b/sinthetic-datasets-b-500000.parquet',
        'extension': 'parquet',
        'columns': ['id_cidacs_b', 'nome_b', 'nome_mae_b', 'dt_nasc_b', 'sexo_b'],
        'id_column_name': 'id_cidacs_b',
        'storage_level': 'MEMORY_ONLY',
        'default_paralelism': '16'},
    'result_dataset': {
        'path': '../0_global_data/result/500000/'}},
        
 'comparisons': {
    'name': {
        'indexed_col': 'nome_a',
        'tolink_col': 'nome_b',
        'must_match': 'true',
        'should_match': 'true',
        'is_fuzzy': 'true',
        'boost': '3.0',
        'query_type': 'match',
        'similarity': 'jaro_winkler',
        'weight': 5.0,
        'penalty': 0.02},
    'mothers_name': {
       'indexed_col': 'nome_mae_a',
       'tolink_col': 'nome_mae_b',
       'must_match': 'true',
       'should_match': 'true',
       'is_fuzzy': 'true',
       'boost': '2.0',
       'query_type': 'match',
       'similarity': 'jaro_winkler',
       'weight': 5.0,
       'penalty': 0.02},
  'birthdate': {
       'indexed_col': 'dt_nasc_a',
       'tolink_col': 'dt_nasc_b',
       'must_match': 'false',
       'should_match': 'true',
       'is_fuzzy': 'false',
       'boost': '',
       'query_type': 'term',
       'similarity': 'hamming',
       'weight': 1.0,
       'penalty': 0.02},
  'sex': {
       'indexed_col': 'sexo_a',
       'tolink_col': 'sexo_b',
       'must_match': 'true',
       'should_match': 'true',
       'is_fuzzy': 'false',
       'boost': '',
       'query_type': 'term',
       'similarity': 'overlap',
       'weight': 3.0,
       'penalty': 0.02}}}

Running in a Standalone Spark Cluster

Read more: https://github.com/elastic/elasticsearch-hadoop https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html https://search.maven.org/artifact/org.elasticsearch/elasticsearch-spark-30_2.12 If you intend to run this tool into a single node Spark environment, consider to include this in you spark-submit or spark-shell command line


pyspark --packages org.elasticsearch:elasticsearch-spark-30_2.12:7.14.0 --conf spark.es.nodes="localhost" --conf spark.es.port="9200"

If you are running into a Spark Cluster under JupyterHUB kernels, try to add this kernel or edit an existing one:


{
	 "display_name": "Spark3.3",
	  "language": "python",
	   "argv": [
		     "/opt/bigdata/anaconda3/bin/python",
		       "-m",
		         "ipykernel",
			   "-f",
			     "{connection_file}"
			      ],
			       "env": {
				         "SPARK_HOME": "/opt/bigdata/spark",
					   "PYTHONPATH": "/opt/bigdata/spark/python:/opt/bigdata/spark/python/lib/py4j-0.10.9.2-src.zip",
					     "PYTHONSTARTUP": "/opt/bigdata/spark/python/pyspark/python/pyspark/shell.py",
					       "PYSPARK_PYTHON": "/opt/bigdata/anaconda3/bin/python",
					         "PYSPARK_SUBMIT_ARGS": "--master spark://node1.sparkcluster:7077 --packages org.elasticsearch:elasticsearch-spark-30_2.12:7.14.0 --conf spark.es.nodes=['node1','node2'] --conf spark.es.port='9200' pyspark-shell"
						  }
}

Some advices for indexed data and queries

  • Every col should be casted as string (df.withColumn('column', F.col('column').cast(string')))
  • Date type columns will not be proper indexed as string, except if some preprocessing step tranform it from yyyy-MM-dd to yyyyMMdd.
  • All the nodes of elasticsearch cluster must be included on --packages configuration.
  • Term queries are good to well structured variables, such as CPF, dates, CNPJ, etc.
Owner
Robespierre Pita
AI Researcher
Robespierre Pita
Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

TensorLayer Community 7.1k Dec 29, 2022
TransZero++: Cross Attribute-guided Transformer for Zero-Shot Learning

TransZero++ This repository contains the testing code for the paper "TransZero++: Cross Attribute-guided Transformer for Zero-Shot Learning" submitted

Shiming Chen 6 Aug 16, 2022
商品推荐系统

商品top50推荐系统 问题建模 本项目的数据集给出了15万左右的用户以及12万左右的商品, 以及对应的经过脱敏处理的用户特征和经过预处理的商品特征,旨在为用户推荐50个其可能购买的商品。 推荐系统架构方案 本项目采用传统的召回+排序的方案。

107 Dec 29, 2022
Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection

CDN Code for our NeurIPS 2021 paper "Mining the Benefits of Two-stage and One-stage HOI Detection". Contributed by Aixi Zhang*, Yue Liao*, Si Liu, Mia

71 Dec 14, 2022
Fast, general, and tested differentiable structured prediction in PyTorch

Fast, general, and tested differentiable structured prediction in PyTorch

HNLP 1.1k Dec 16, 2022
Annealed Flow Transport Monte Carlo

Annealed Flow Transport Monte Carlo Open source implementation accompanying ICML 2021 paper by Michael Arbel*, Alexander G. D. G. Matthews* and Arnaud

DeepMind 30 Nov 21, 2022
The official project of SimSwap (ACM MM 2020)

SimSwap: An Efficient Framework For High Fidelity Face Swapping Proceedings of the 28th ACM International Conference on Multimedia The official reposi

Six_God 2.6k Jan 08, 2023
The Official Implementation of Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose [NIPS 2021].

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose Release Notes The offical PyTorch implementation of Neural View Sy

Angtian Wang 20 Oct 09, 2022
Dark Finix: All in one hacking framework with almost 100 tools

Dark Finix - Hacking Framework. Dark Finix is a all in one hacking framework wit

Md. Nur habib 2 Feb 18, 2022
Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition (NeurIPS 2019)

MLCR This is the source code for paper Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition. Xuesong Niu, Hu Han, Shiguang

Edson-Niu 60 Nov 29, 2022
Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

如今我已剑指天涯 46 Dec 21, 2022
Original Implementation of Prompt Tuning from Lester, et al, 2021

Prompt Tuning This is the code to reproduce the experiments from the EMNLP 2021 paper "The Power of Scale for Parameter-Efficient Prompt Tuning" (Lest

Google Research 282 Dec 28, 2022
Camview - A CLI-tool used to stream CCTV online footage based on URL params

CamView A CLI-tool used to stream CCTV online footage based on URL params Get St

Finn Lancaster 54 Dec 09, 2022
Pytorch implementation of

EfficientTTS Unofficial Pytorch implementation of "EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture"(arXiv). Disclaimer: Somebo

Liu Songxiang 109 Nov 16, 2022
An end-to-end implementation of intent prediction with Metaflow and other cool tools

You Don't Need a Bigger Boat An end-to-end (Metaflow-based) implementation of an intent prediction flow for kids who can't MLOps good and wanna learn

Jacopo Tagliabue 614 Dec 31, 2022
yolov5目标检测模型的知识蒸馏(基于响应的蒸馏)

代码地址: https://github.com/Sharpiless/yolov5-knowledge-distillation 教师模型: python train.py --weights weights/yolov5m.pt \ --cfg models/yolov5m.ya

52 Dec 04, 2022
Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

SMOP is Small Matlab and Octave to Python compiler. SMOP translates matlab to py

Tom Xu 1 Jan 12, 2022
CTRL-C: Camera calibration TRansformer with Line-Classification

CTRL-C: Camera calibration TRansformer with Line-Classification This repository contains the official code and pretrained models for CTRL-C (Camera ca

57 Nov 14, 2022
From the basics to slightly more interesting applications of Tensorflow

TensorFlow Tutorials You can find python source code under the python directory, and associated notebooks under notebooks. Source code Description 1 b

Parag K Mital 5.6k Jan 09, 2023