Construção Da ABT - Aplicação de PySpark e SQL
Construção Da ABT - Aplicação de PySpark e SQL
Sumário
1. Importação da bibliotecas
2. Criação e iniciação de uma sessão PySpark
3. Criação dos datasets a partir da leitura dos arquivos *.csv e *.json
3.1. Arquivos CSV
3.2. Arquivos JSON
4. Análise dos dados
5. Construção da ABT
5.1. Análise de dados da ABT
5.2. Análise do target da ABT
5.3. Gerando uma amostra representativa da ABT
6. Salvando a ABT em formato parquet
1. Importação de bibliotecas
In [90]: from pyspark.sql import SparkSession
import os
import json
import csv
spark.sparkContext.setLogLevel('ERROR')
spark
Out[91]: SparkSession - in-memory
SparkContext
Spark UI
Version v3.5.4
Master local[*]
AppName ABT - Transações de clientes
cards_data.csv
root
|-- id: integer (nullable = true)
|-- client_id: integer (nullable = true)
|-- card_brand: string (nullable = true)
|-- card_type: string (nullable = true)
|-- card_number: long (nullable = true)
|-- expires: string (nullable = true)
|-- cvv: integer (nullable = true)
|-- has_chip: string (nullable = true)
|-- num_cards_issued: integer (nullable = true)
|-- credit_limit: string (nullable = true)
|-- acct_open_date: string (nullable = true)
|-- year_pin_last_changed: integer (nullable = true)
|-- card_on_dark_web: string (nullable = true)
transactions_data.csv
users_data.csv
root
|-- id: integer (nullable = true)
|-- current_age: integer (nullable = true)
|-- retirement_age: integer (nullable = true)
|-- birth_year: integer (nullable = true)
|-- birth_month: integer (nullable = true)
|-- gender: string (nullable = true)
|-- address: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- per_capita_income: string (nullable = true)
|-- yearly_income: string (nullable = true)
|-- total_debt: string (nullable = true)
|-- credit_score: integer (nullable = true)
|-- num_credit_cards: integer (nullable = true)
if not data:
print('Arquivo JSON está vazio.')
return
# Escrever o CSV
with open(csv_path, 'w', encoding='utf-8', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(list_columns) # Escrevendo o cabeçalho
mcc_codes.json
root
|-- code: integer (nullable = true)
|-- description: string (nullable = true)
train_fraud_labels.json
root
|-- transaction_id: integer (nullable = true)
|-- is_fraud: string (nullable = true)
Volumetria do df_cards
Linhas: 6146
Colunas: 13
Volumetria do df_transactions
Linhas: 13305915
Colunas: 12
Volumetria do df_clients
Linhas: 2000
Colunas: 14
Volumetria do df_mcc
Linhas: 109
Colunas: 2
Volumetria do df_train_fraud
Linhas: 8914963
Colunas: 2
In [103… df_cards.createOrReplaceTempView('tb_cards')
df_clients.createOrReplaceTempView('tb_clients')
df_mcc.createOrReplaceTempView('tb_mcc')
df_train_fraud.createOrReplaceTempView('tb_train_fraud')
df_transactions.createOrReplaceTempView('tb_transactions')
+----+-----------+--------------+----------+-----------+------+------------------------+--------+---
------+-----------------+-------------+----------+------------+----------------+
|id |current_age|retirement_age|birth_year|birth_month|gender|address |latitude|lon
gitude|per_capita_income|yearly_income|total_debt|credit_score|num_credit_cards|
+----+-----------+--------------+----------+-----------+------+------------------------+--------+---
------+-----------------+-------------+----------+------------+----------------+
|825 |53 |66 |1966 |11 |Female|462 Rose Lane |34.15 |-11
7.76 |$29278 |$59696 |$127613 |787 |5 |
|1746|53 |68 |1966 |12 |Female|3606 Federal Boulevard |40.76 |-7
3.74 |$37891 |$77254 |$191349 |701 |5 |
|1718|81 |67 |1938 |11 |Female|766 Third Drive |34.02 |-11
7.89 |$22681 |$33483 |$196 |698 |5 |
|708 |63 |63 |1957 |1 |Female|3 Madison Street |40.71 |-7
3.99 |$163145 |$249925 |$202328 |722 |4 |
|1164|43 |70 |1976 |9 |Male |9620 Valley Stream Drive|37.76 |-12
2.44 |$53797 |$109687 |$183855 |675 |1 |
|68 |42 |70 |1977 |10 |Male |58 Birch Lane |41.55 |-9
0.6 |$20599 |$41997 |$0 |704 |3 |
|1075|36 |67 |1983 |12 |Female|5695 Fifth Street |38.22 |-8
5.74 |$25258 |$51500 |$102286 |672 |3 |
|1711|26 |67 |1993 |12 |Male |1941 Ninth Street |45.51 |-12
2.64 |$26790 |$54623 |$114711 |728 |1 |
|1116|81 |66 |1938 |7 |Female|11 Spruce Avenue |40.32 |-7
5.32 |$26273 |$42509 |$2895 |755 |5 |
|1752|34 |60 |1986 |1 |Female|887 Grant Street |29.97 |-9
2.12 |$18730 |$38190 |$81262 |810 |1 |
+----+-----------+--------------+----------+-----------+------+------------------------+--------+---
------+-----------------+-------------+----------+------------+----------------+
only showing top 10 rows
Visualizando os dados da tb_mcc
+----+------------------------------------------+
|code|description |
+----+------------------------------------------+
|5812|Eating Places and Restaurants |
|5541|Service Stations |
|7996|Amusement Parks, Carnivals, Circuses |
|5411|Grocery Stores, Supermarkets |
|4784|Tolls and Bridge Fees |
|4900|Utilities - Electric, Gas, Water, Sanitary|
|5942|Book Stores |
|5814|Fast Food Restaurants |
|4829|Money Transfer |
|5311|Department Stores |
+----+------------------------------------------+
only showing top 10 rows
+--------------+--------+
|transaction_id|is_fraud|
+--------------+--------+
|10649266 |No |
|23410063 |No |
|9316588 |No |
|12478022 |No |
|9558530 |No |
|12532830 |No |
|19526714 |No |
|9906964 |No |
|13224888 |No |
|13749094 |No |
+--------------+--------+
only showing top 10 rows
5. Construção da ABT
In [112… abt_01.createOrReplaceTempView('tb_abt_01')
Volumetria do tb_abt_01
Linhas: 13305915
Colunas: 43
+-----------+-------------------+---------------+------------+-----------+
|CONTEM_CHIP|METODO_AUTENTICACAO|QTDE_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|
+-----------+-------------------+---------------+------------+-----------+
|YES |Swipe Transaction |5774805 |801 |0.014 |
|YES |Chip Transaction |4780818 |3176 |0.066 |
|YES |Online Transaction |1419172 |8155 |0.575 |
|NO |Swipe Transaction |1192380 |576 |0.048 |
|NO |Online Transaction |138740 |624 |0.45 |
+-----------+-------------------+---------------+------------+-----------+
+-----------+---------------+------------+-----------+
|value_range|QTDE_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|
+-----------+---------------+------------+-----------+
|0-50 |8198871 |4943 |0.06 |
|51-100 |2939552 |2761 |0.094 |
|101-500 |1366430 |4690 |0.343 |
|1000+ |767187 |687 |0.09 |
|501-1000 |33875 |251 |0.741 |
+-----------+---------------+------------+-----------+
+--------------+---------------+------------+-----------+
|UF_COMERCIANTE|QTDE_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|
+--------------+---------------+------------+-----------+
|ONLINE |1563700 |8779 |0.561 |
|CA |1427087 |127 |0.009 |
|TX |1010207 |76 |0.008 |
|NY |857510 |58 |0.007 |
|FL |701623 |63 |0.009 |
|OH |484122 |316 |0.065 |
|IL |467931 |36 |0.008 |
|NC |429427 |44 |0.01 |
|PA |417766 |33 |0.008 |
|MI |397970 |39 |0.01 |
|GA |368206 |22 |0.006 |
|NJ |322227 |42 |0.013 |
|IN |312470 |26 |0.008 |
|WA |286525 |23 |0.008 |
|TN |284709 |14 |0.005 |
|VA |230685 |26 |0.011 |
|AZ |195940 |11 |0.006 |
|MO |195854 |32 |0.016 |
|MD |193776 |18 |0.009 |
|MN |178808 |14 |0.008 |
+--------------+---------------+------------+-----------+
only showing top 20 rows
+------+---------------+------------+-----------+
|GENERO|QTDE_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|
+------+---------------+------------+-----------+
|Female|6815916 |6982 |0.102 |
|Male |6489999 |6350 |0.098 |
+------+---------------+------------+-----------+
+--------+---------------+----------+
|is_fraud|QTDE_TRANSACOES|PERCENTUAL|
+--------+---------------+----------+
|No |8901631 |66.9 |
|NULL |4390952 |33.0 |
|Yes |13332 |0.1 |
+--------+---------------+----------+
+----+--------+---------------+----------+
|ANO |is_fraud|QTDE_TRANSACOES|PERCENTUAL|
+----+--------+---------------+----------+
|2010|Yes |2573 |0.02 |
|2010|No |828956 |6.23 |
|2010|NULL |409351 |3.08 |
|2011|Yes |37 |0.0 |
|2011|No |863391 |6.49 |
|2011|NULL |427342 |3.21 |
|2012|Yes |923 |0.01 |
|2012|No |884498 |6.65 |
|2012|NULL |436251 |3.28 |
|2013|Yes |1337 |0.01 |
|2013|No |905967 |6.81 |
|2013|NULL |445504 |3.35 |
|2014|Yes |664 |0.0 |
|2014|No |914409 |6.87 |
|2014|NULL |450464 |3.39 |
|2015|Yes |2189 |0.02 |
|2015|No |928035 |6.97 |
|2015|NULL |457841 |3.44 |
|2016|Yes |2448 |0.02 |
|2016|No |930314 |6.99 |
|2016|NULL |459355 |3.45 |
|2017|Yes |172 |0.0 |
|2017|No |937112 |7.04 |
|2017|NULL |462024 |3.47 |
|2018|Yes |1629 |0.01 |
|2018|No |932970 |7.01 |
|2018|NULL |460193 |3.46 |
|2019|Yes |1360 |0.01 |
|2019|No |775979 |5.83 |
|2019|NULL |382627 |2.88 |
+----+--------+---------------+----------+
Out[124… DataFrame[id: int, date: timestamp, client_id: int, card_id: int, amount: string, use_chip: strin
g, merchant_id: int, merchant_city: string, merchant_state: string, zip: double, mcc: int, errors:
string, id_card: int, client_id_card: int, card_brand: string, card_type: string, card_number: big
int, expires: string, cvv: int, has_chip: string, num_cards_issued: int, credit_limit: string, acc
t_open_date: string, year_pin_last_changed: int, card_on_dark_web: string, id_client: int, current
_age: int, retirement_age: int, birth_year: int, birth_month: int, gender: string, address: strin
g, latitude: double, longitude: double, per_capita_income: string, yearly_income: string, total_de
bt: string, credit_score: int, num_credit_cards: int, code: int, description: string, transaction_
id: int, is_fraud: string]
+--------+---------------+----------+
|is_fraud|QTDE_TRANSACOES|PERCENTUAL|
+--------+---------------+----------+
|No |4450946 |66.913 |
|NULL |2194300 |32.988 |
|Yes |6640 |0.1 |
+--------+---------------+----------+
Volumetria do tb_abt_02
Linhas: 6651886
Colunas: 43