0% acharam este documento útil (0 voto)
6 visualizações21 páginas

Construção Da ABT - Aplicação de PySpark e SQL

O documento descreve o processo de construção de uma Tabela Base Analítica (ABT) utilizando PySpark, incluindo a importação de bibliotecas, criação de sessões e leitura de arquivos CSV e JSON. Ele detalha a análise dos dados, a construção da ABT e a geração de amostras representativas, além de salvar a ABT em formato parquet. O documento também inclui a volumetria dos dataframes e a criação de views temporárias para análise.

Enviado por

magno silva
Direitos autorais
© © All Rights Reserved
Levamos muito a sério os direitos de conteúdo. Se você suspeita que este conteúdo é seu, reivindique-o aqui.
Formatos disponíveis
Baixe no formato PDF, TXT ou leia on-line no Scribd
0% acharam este documento útil (0 voto)
6 visualizações21 páginas

Construção Da ABT - Aplicação de PySpark e SQL

O documento descreve o processo de construção de uma Tabela Base Analítica (ABT) utilizando PySpark, incluindo a importação de bibliotecas, criação de sessões e leitura de arquivos CSV e JSON. Ele detalha a análise dos dados, a construção da ABT e a geração de amostras representativas, além de salvar a ABT em formato parquet. O documento também inclui a volumetria dos dataframes e a criação de views temporárias para análise.

Enviado por

magno silva
Direitos autorais
© © All Rights Reserved
Levamos muito a sério os direitos de conteúdo. Se você suspeita que este conteúdo é seu, reivindique-o aqui.
Formatos disponíveis
Baixe no formato PDF, TXT ou leia on-line no Scribd
Você está na página 1/ 21

Construção da ABT

ABT - Analytical Base Table

Sumário
1. Importação da bibliotecas
2. Criação e iniciação de uma sessão PySpark
3. Criação dos datasets a partir da leitura dos arquivos *.csv e *.json
3.1. Arquivos CSV
3.2. Arquivos JSON
4. Análise dos dados
5. Construção da ABT
5.1. Análise de dados da ABT
5.2. Análise do target da ABT
5.3. Gerando uma amostra representativa da ABT
6. Salvando a ABT em formato parquet

1. Importação de bibliotecas
In [90]: from pyspark.sql import SparkSession
import os
import json
import csv

2. Criação e iniciação de uma sessão PySpark


In [91]: appName = 'ABT - Transações de clientes'

# O objeto SparkSession configurado com configurações específicas


spark = SparkSession.builder \
.appName(appName) \
.config('spark.driver.memory', '8g') \
.config('spark.executor.memory', '8g') \
.config('spark.master', 'local[*]') \
.getOrCreate()

spark.sparkContext.setLogLevel('ERROR')

spark
Out[91]: SparkSession - in-memory

SparkContext

Spark UI

Version v3.5.4
Master local[*]
AppName ABT - Transações de clientes

3. Criação dos datasets a partir da leitura dos arquivos *.csv e


*.json
In [92]: # Caminho dos arquivos *.csv e *.json
caminho = f'dados/'

3.1. Arquivos CSV

cards_data.csv

In [93]: # Dados de cartão de crédito


df_cards = spark.read.csv(caminho + 'cards_data.csv', header=True, inferSchema=True)

# Exibir o esquema do dataframe


df_cards.printSchema()

root
|-- id: integer (nullable = true)
|-- client_id: integer (nullable = true)
|-- card_brand: string (nullable = true)
|-- card_type: string (nullable = true)
|-- card_number: long (nullable = true)
|-- expires: string (nullable = true)
|-- cvv: integer (nullable = true)
|-- has_chip: string (nullable = true)
|-- num_cards_issued: integer (nullable = true)
|-- credit_limit: string (nullable = true)
|-- acct_open_date: string (nullable = true)
|-- year_pin_last_changed: integer (nullable = true)
|-- card_on_dark_web: string (nullable = true)

transactions_data.csv

In [94]: # Dados de transações


df_transactions = spark.read.csv(caminho + 'transactions_data.csv', header=True, inferSchema=True)

# Exibir o esquema do dataframe


df_transactions.printSchema()
root
|-- id: integer (nullable = true)
|-- date: timestamp (nullable = true)
|-- client_id: integer (nullable = true)
|-- card_id: integer (nullable = true)
|-- amount: string (nullable = true)
|-- use_chip: string (nullable = true)
|-- merchant_id: integer (nullable = true)
|-- merchant_city: string (nullable = true)
|-- merchant_state: string (nullable = true)
|-- zip: double (nullable = true)
|-- mcc: integer (nullable = true)
|-- errors: string (nullable = true)

users_data.csv

In [95]: # Dados de clientes


df_clients = spark.read.csv(caminho + 'users_data.csv', header=True, inferSchema=True)

# Exibir o esquema do dataframe


df_clients.printSchema()

root
|-- id: integer (nullable = true)
|-- current_age: integer (nullable = true)
|-- retirement_age: integer (nullable = true)
|-- birth_year: integer (nullable = true)
|-- birth_month: integer (nullable = true)
|-- gender: string (nullable = true)
|-- address: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- per_capita_income: string (nullable = true)
|-- yearly_income: string (nullable = true)
|-- total_debt: string (nullable = true)
|-- credit_score: integer (nullable = true)
|-- num_credit_cards: integer (nullable = true)

3.2. Arquivos JSON


In [96]: def process_json_to_csv(json_path, csv_path, list_columns, one_column=None):
'''
Processa um arquivo JSON e converte em CSV.

:param json_path: path + filename.json


Caminho do arquivo JSON de entrada.
:param csv_path: path + filename.csv
Caminho do arquivo CSV de saída.
:param list_columns: list
Lista com os nomes das colunas do CSV.
:param one_column: string (default=None)
Nome da chave no JSON para acessar dados aninhados (opcional).
'''
# Carregar o arquivo JSON
with open(json_path, 'r', encoding='utf-8') as json_file:
data = json.load(json_file)

if not data:
print('Arquivo JSON está vazio.')
return

# Processar dados aninhados, se necessário


if one_column:
data = data.get(one_column, {})
if not data:
print(f'A chave \'{one_column}\' não foi encontrada ou está vazia no JSON.')
return
print(f'Dados extraídos da chave \'{one_column}\'.')
else:
print('Processando JSON sem dados aninhados.')

# Escrever o CSV
with open(csv_path, 'w', encoding='utf-8', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(list_columns) # Escrevendo o cabeçalho

for key, value in data.items():


writer.writerow([key, value]) # Escrevendo as linhas

print(f'Arquivo CSV gerado com sucesso em {csv_path}.')

mcc_codes.json

In [97]: # Processar mcc_codes.json


json_path = caminho + 'mcc_codes.json'
csv_path = caminho + 'mcc_codes.csv'
list_columns = ['code', 'description']
process_json_to_csv(json_path, csv_path, list_columns)

Processando JSON sem dados aninhados.


Arquivo CSV gerado com sucesso em dados/mcc_codes.csv.

In [98]: # Dados de códigos de categoria de mercadorias


df_mcc = spark.read.csv(caminho + 'mcc_codes.csv', header=True, inferSchema=True)

# Exibir o esquema do dataframe


df_mcc.printSchema()

root
|-- code: integer (nullable = true)
|-- description: string (nullable = true)

train_fraud_labels.json

In [99]: # Processar train_fraud_labels.json


json_path = caminho + 'train_fraud_labels.json'
csv_path = caminho + 'train_fraud_labels.csv'
list_columns = ['transaction_id', 'is_fraud']
one_column = 'target'
process_json_to_csv(json_path, csv_path, list_columns, one_column)

Dados extraídos da chave 'target'.


Arquivo CSV gerado com sucesso em dados/train_fraud_labels.csv.

In [100… # Dados de rótulos de fraude


df_train_fraud = spark.read.csv(caminho + 'train_fraud_labels.csv', header=True, inferSchema=True)

# Exibir o esquema do dataframe


df_train_fraud.printSchema()

root
|-- transaction_id: integer (nullable = true)
|-- is_fraud: string (nullable = true)

4. Análise dos dados


Volumetria dos dataframes

In [101… def dataframe_volumetry(df, df_name):


'''
Exibe a quantidade de linhas e colunas de um dataframe.

:param df: DataFrame


O dataframe a ser analisado.
:param df_name: string
Nome do dataframe a ser analisado.
'''
print(f'Volumetria do \033[3m{df_name}\033[0m')
print(f'Linhas: {df.count()}')
print(f'Colunas: {len(df.columns)}\n')

In [102… dataframe_volumetry(df_cards, 'df_cards')


dataframe_volumetry(df_transactions, 'df_transactions')
dataframe_volumetry(df_clients, 'df_clients')
dataframe_volumetry(df_mcc, 'df_mcc')
dataframe_volumetry(df_train_fraud, 'df_train_fraud')

Volumetria do df_cards
Linhas: 6146
Colunas: 13

Volumetria do df_transactions
Linhas: 13305915
Colunas: 12

Volumetria do df_clients
Linhas: 2000
Colunas: 14

Volumetria do df_mcc
Linhas: 109
Colunas: 2

Volumetria do df_train_fraud
Linhas: 8914963
Colunas: 2

Criação de Views temporárias para cada dataframe

In [103… df_cards.createOrReplaceTempView('tb_cards')
df_clients.createOrReplaceTempView('tb_clients')
df_mcc.createOrReplaceTempView('tb_mcc')
df_train_fraud.createOrReplaceTempView('tb_train_fraud')
df_transactions.createOrReplaceTempView('tb_transactions')

Visualizando os dados da tb_cards

In [104… df_cards.show(10, truncate=False)


+----+---------+----------+---------------+----------------+-------+---+--------+----------------+--
----------+--------------+---------------------+----------------+
|id |client_id|card_brand|card_type |card_number |expires|cvv|has_chip|num_cards_issued|cr
edit_limit|acct_open_date|year_pin_last_changed|card_on_dark_web|
+----+---------+----------+---------------+----------------+-------+---+--------+----------------+--
----------+--------------+---------------------+----------------+
|4524|825 |Visa |Debit |4344676511950444|12/2022|623|YES |2 |$2
4295 |09/2002 |2008 |No |
|2731|825 |Visa |Debit |4956965974959986|12/2020|393|YES |2 |$2
1968 |04/2014 |2014 |No |
|3701|825 |Visa |Debit |4582313478255491|02/2024|719|YES |2 |$4
6414 |07/2003 |2004 |No |
|42 |825 |Visa |Credit |4879494103069057|08/2024|693|NO |1 |$1
2400 |01/2003 |2012 |No |
|4659|825 |Mastercard|Debit (Prepaid)|5722874738736011|03/2009|75 |YES |1 |$2
8 |09/2008 |2009 |No |
|4537|1746 |Visa |Credit |4404898874682993|09/2003|736|YES |1 |$2
7500 |09/2003 |2012 |No |
|1278|1746 |Visa |Debit |4001482973848631|07/2022|972|YES |2 |$2
8508 |02/2011 |2011 |No |
|3687|1746 |Mastercard|Debit |5627220683410948|06/2022|48 |YES |2 |$9
022 |07/2003 |2015 |No |
|3465|1746 |Mastercard|Debit (Prepaid)|5711382187309326|11/2020|722|YES |2 |$5
4 |06/2010 |2015 |No |
|3754|1746 |Mastercard|Debit (Prepaid)|5766121508358701|02/2023|908|YES |1 |$9
9 |07/2006 |2012 |No |
+----+---------+----------+---------------+----------------+-------+---+--------+----------------+--
----------+--------------+---------------------+----------------+
only showing top 10 rows

Visualizando os dados da tb_clients

In [105… df_clients.show(10, truncate=False)

+----+-----------+--------------+----------+-----------+------+------------------------+--------+---
------+-----------------+-------------+----------+------------+----------------+
|id |current_age|retirement_age|birth_year|birth_month|gender|address |latitude|lon
gitude|per_capita_income|yearly_income|total_debt|credit_score|num_credit_cards|
+----+-----------+--------------+----------+-----------+------+------------------------+--------+---
------+-----------------+-------------+----------+------------+----------------+
|825 |53 |66 |1966 |11 |Female|462 Rose Lane |34.15 |-11
7.76 |$29278 |$59696 |$127613 |787 |5 |
|1746|53 |68 |1966 |12 |Female|3606 Federal Boulevard |40.76 |-7
3.74 |$37891 |$77254 |$191349 |701 |5 |
|1718|81 |67 |1938 |11 |Female|766 Third Drive |34.02 |-11
7.89 |$22681 |$33483 |$196 |698 |5 |
|708 |63 |63 |1957 |1 |Female|3 Madison Street |40.71 |-7
3.99 |$163145 |$249925 |$202328 |722 |4 |
|1164|43 |70 |1976 |9 |Male |9620 Valley Stream Drive|37.76 |-12
2.44 |$53797 |$109687 |$183855 |675 |1 |
|68 |42 |70 |1977 |10 |Male |58 Birch Lane |41.55 |-9
0.6 |$20599 |$41997 |$0 |704 |3 |
|1075|36 |67 |1983 |12 |Female|5695 Fifth Street |38.22 |-8
5.74 |$25258 |$51500 |$102286 |672 |3 |
|1711|26 |67 |1993 |12 |Male |1941 Ninth Street |45.51 |-12
2.64 |$26790 |$54623 |$114711 |728 |1 |
|1116|81 |66 |1938 |7 |Female|11 Spruce Avenue |40.32 |-7
5.32 |$26273 |$42509 |$2895 |755 |5 |
|1752|34 |60 |1986 |1 |Female|887 Grant Street |29.97 |-9
2.12 |$18730 |$38190 |$81262 |810 |1 |
+----+-----------+--------------+----------+-----------+------+------------------------+--------+---
------+-----------------+-------------+----------+------------+----------------+
only showing top 10 rows
Visualizando os dados da tb_mcc

In [106… df_mcc.show(10, truncate=False)

+----+------------------------------------------+
|code|description |
+----+------------------------------------------+
|5812|Eating Places and Restaurants |
|5541|Service Stations |
|7996|Amusement Parks, Carnivals, Circuses |
|5411|Grocery Stores, Supermarkets |
|4784|Tolls and Bridge Fees |
|4900|Utilities - Electric, Gas, Water, Sanitary|
|5942|Book Stores |
|5814|Fast Food Restaurants |
|4829|Money Transfer |
|5311|Department Stores |
+----+------------------------------------------+
only showing top 10 rows

Visualizando os dados da tb_train_fraud

In [107… df_train_fraud.show(10, truncate=False)

+--------------+--------+
|transaction_id|is_fraud|
+--------------+--------+
|10649266 |No |
|23410063 |No |
|9316588 |No |
|12478022 |No |
|9558530 |No |
|12532830 |No |
|19526714 |No |
|9906964 |No |
|13224888 |No |
|13749094 |No |
+--------------+--------+
only showing top 10 rows

Visualizando os dados da tb_transactions

In [108… df_transactions.show(10, truncate=False)


+-------+-------------------+---------+-------+-------+------------------+-----------+-------------+
--------------+-------+----+------+
|id |date |client_id|card_id|amount |use_chip |merchant_id|merchant_city|
merchant_state|zip |mcc |errors|
+-------+-------------------+---------+-------+-------+------------------+-----------+-------------+
--------------+-------+----+------+
|7475327|2010-01-01 00:01:00|1556 |2972 |$-77.00|Swipe Transaction |59935 |Beulah |
ND |58523.0|5499|NULL |
|7475328|2010-01-01 00:02:00|561 |4575 |$14.57 |Swipe Transaction |67570 |Bettendorf |
IA |52722.0|5311|NULL |
|7475329|2010-01-01 00:02:00|1129 |102 |$80.00 |Swipe Transaction |27092 |Vista |
CA |92084.0|4829|NULL |
|7475331|2010-01-01 00:05:00|430 |2860 |$200.00|Swipe Transaction |27092 |Crown Point |
IN |46307.0|4829|NULL |
|7475332|2010-01-01 00:06:00|848 |3915 |$46.41 |Swipe Transaction |13051 |Harwood |
MD |20776.0|5813|NULL |
|7475333|2010-01-01 00:07:00|1807 |165 |$4.81 |Swipe Transaction |20519 |Bronx |
NY |10464.0|5942|NULL |
|7475334|2010-01-01 00:09:00|1556 |2972 |$77.00 |Swipe Transaction |59935 |Beulah |
ND |58523.0|5499|NULL |
|7475335|2010-01-01 00:14:00|1684 |2140 |$26.46 |Online Transaction|39021 |ONLINE |
NULL |NULL |4784|NULL |
|7475336|2010-01-01 00:21:00|335 |5131 |$261.58|Online Transaction|50292 |ONLINE |
NULL |NULL |7801|NULL |
|7475337|2010-01-01 00:21:00|351 |1112 |$10.74 |Swipe Transaction |3864 |Flushing |
NY |11355.0|5813|NULL |
+-------+-------------------+---------+-------+-------+------------------+-----------+-------------+
--------------+-------+----+------+
only showing top 10 rows

5. Construção da ABT

Alterando o nome das colunas de identificação dos dataframes

In [109… df_cards = df_cards.withColumnRenamed('id', 'id_card')


df_cards = df_cards.withColumnRenamed('client_id', 'client_id_card')
df_clients = df_clients.withColumnRenamed('id', 'id_client')

Executando o join entre os dataframes

In [110… abt_01 = df_transactions \


.join(df_cards, df_transactions['card_id'] == df_cards['id_card'], how='inner') \
.join(df_clients, df_transactions['client_id'] == df_clients['id_client'], how='inner') \
.join(df_mcc, df_transactions['mcc'] == df_mcc['code'], how='inner') \
.join(df_train_fraud, df_transactions['id'] == df_train_fraud['transaction_id'], how='left')

In [111… abt_01.show(20, truncate=False)


+--------+-------------------+---------+-------+-------+------------------+-----------+-------------
---+--------------+-------+----+-------+-------+--------------+----------+---------+----------------
+-------+---+--------+----------------+------------+--------------+---------------------+-----------
-----+---------+-----------+--------------+----------+-----------+------+---------------------------
-----+--------+---------+-----------------+-------------+----------+------------+----------------+--
--+----------------------------------+--------------+--------+
|id |date |client_id|card_id|amount |use_chip |merchant_id|merchant_city
|merchant_state|zip |mcc |errors |id_card|client_id_card|card_brand|card_type|card_number |ex
pires|cvv|has_chip|num_cards_issued|credit_limit|acct_open_date|year_pin_last_changed|card_on_dark_w
eb|id_client|current_age|retirement_age|birth_year|birth_month|gender|address
|latitude|longitude|per_capita_income|yearly_income|total_debt|credit_score|num_credit_cards|code|de
scription |transaction_id|is_fraud|
+--------+-------------------+---------+-------+-------+------------------+-----------+-------------
---+--------------+-------+----+-------+-------+--------------+----------+---------+----------------
+-------+---+--------+----------------+------------+--------------+---------------------+-----------
-----+---------+-----------+--------------+----------+-----------+------+---------------------------
-----+--------+---------+-----------------+-------------+----------+------------+----------------+--
--+----------------------------------+--------------+--------+
|7475336 |2010-01-01 00:21:00|335 |5131 |$261.58|Online Transaction|50292 |ONLINE
|NULL |NULL |7801|NULL |5131 |335 |Visa |Debit |4414800408438414|0
6/2020|833|YES |1 |$23401 |10/2008 |2011 |No
|335 |46 |68 |1973 |7 |Female|75 Birch Lane
|26.74 |-80.12 |$27696 |$56467 |$66565 |688 |3 |7801|At
hletic Fields, Commercial Sports|NULL |NULL |
|7475338 |2010-01-01 00:23:00|554 |3912 |$3.51 |Swipe Transaction |67570 |Pearland
|TX |77581.0|5311|NULL |3912 |554 |Visa |Debit |4096589319918041|0
4/2021|856|NO |1 |$25658 |07/2009 |2009 |No
|554 |59 |67 |1960 |8 |Male |6310 Sixth Street
|29.66 |-95.04 |$26170 |$53357 |$114266 |690 |5 |5311|De
partment Stores |7475338 |No |
|7475341 |2010-01-01 00:27:00|1797 |1127 |$43.33 |Swipe Transaction |33326 |Kahului
|HI |96732.0|4121|NULL |1127 |1797 |Visa |Debit |4777281869545650|0
2/2017|256|YES |2 |$23237 |02/2007 |2009 |No
|1797 |67 |65 |1952 |11 |Male |391 Martin Luther King Boulevard
|37.71 |-122.16 |$24971 |$30962 |$15336 |743 |5 |4121|Ta
xicabs and Limousines |7475341 |No |
|9190163 |2011-02-22 21:42:00|1482 |1254 |$127.86|Swipe Transaction |2932 |Alpharetta
|GA |30005.0|7922|NULL |1254 |1482 |Visa |Debit |4474303260979464|1
0/2021|877|YES |2 |$34877 |02/2010 |2012 |No
|1482 |54 |64 |1965 |8 |Male |52 Sixth Boulevard
|34.06 |-84.27 |$43725 |$89152 |$162012 |675 |1 |7922|Th
eatrical Producers |9190163 |No |
|9190168 |2011-02-22 21:47:00|1340 |2954 |$147.12|Swipe Transaction |61195 |Houston
|TX |77064.0|5541|NULL |2954 |1340 |Amex |Credit |329808568475348 |0
6/2021|259|NO |1 |$13300 |05/2008 |2013 |No
|1340 |53 |68 |1966 |8 |Male |815 Hillside Drive
|29.76 |-95.38 |$26420 |$53872 |$64269 |691 |5 |5541|Se
rvice Stations |9190168 |No |
|9190178 |2011-02-22 21:54:00|1969 |4231 |$58.70 |Online Transaction|88459 |ONLINE
|NULL |NULL |5311|NULL |4231 |1969 |Visa |Debit |4575344947327711|1
2/2016|350|NO |2 |$28759 |08/2007 |2014 |No
|1969 |84 |65 |1935 |4 |Male |402 El Camino Drive
|27.48 |-82.57 |$29583 |$44690 |$2249 |733 |5 |5311|De
partment Stores |9190178 |No |
|10916272|2012-03-28 11:13:00|380 |5838 |$12.43 |Swipe Transaction |75781 |Des Moines
|IA |50317.0|5411|NULL |5838 |380 |Mastercard|Credit |5764817007089895|0
7/2023|815|YES |2 |$10800 |12/2005 |2010 |No
|380 |46 |70 |1974 |1 |Male |3551 Spruce Boulevard
|41.57 |-93.61 |$18568 |$37857 |$57727 |800 |2 |5411|Gr
ocery Stores, Supermarkets |NULL |NULL |
|12635225|2013-04-19 18:12:00|1451 |3226 |$52.04 |Swipe Transaction |59935 |Revere
|MA |2151.0 |5499|NULL |3226 |1451 |Mastercard|Debit |5004479479624202|0
6/2024|206|YES |1 |$34319 |06/2001 |2013 |No
|1451 |58 |57 |1961 |4 |Female|9384 Lake Street
|42.41 |-70.99 |$20979 |$17078 |$25245 |719 |4 |5499|Mi
scellaneous Food Stores |12635225 |No |
|12635227|2013-04-19 18:13:00|197 |1302 |$45.15 |Swipe Transaction |26426 |Wilmore
|KY |40390.0|5812|Bad PIN|1302 |197 |Mastercard|Debit |5237861402927606|0
1/2022|536|YES |1 |$17889 |02/2012 |2012 |No
|197 |32 |60 |1987 |5 |Male |1867 Bayview Street
|37.86 |-84.65 |$15757 |$32128 |$58373 |724 |5 |5812|Ea
ting Places and Restaurants |12635227 |No |
|12635235|2013-04-19 18:14:00|655 |4134 |$20.00 |Swipe Transaction |27092 |Madison
|OH |44057.0|4829|NULL |4134 |655 |Mastercard|Debit |5775179305795307|1
2/2022|857|YES |2 |$11638 |08/2004 |2011 |No
|655 |49 |66 |1970 |3 |Female|211 Lexington Drive
|41.44 |-82.18 |$14880 |$30340 |$64661 |552 |3 |4829|Mo
ney Transfer |12635235 |No |
|14358048|2014-05-03 11:57:00|848 |3915 |$19.07 |Swipe Transaction |31250 |Harwood
|MD |20776.0|7210|NULL |3915 |848 |Visa |Debit |4354185735186651|0
1/2020|120|YES |1 |$19113 |07/2009 |2014 |No
|848 |51 |69 |1968 |5 |Male |166 River Drive
|38.86 |-76.6 |$33529 |$68362 |$96182 |711 |2 |7210|La
undry Services |14358048 |No |
|14358065|2014-05-03 12:02:00|605 |212 |$1.87 |Swipe Transaction |74005 |Brooklyn
|NY |11210.0|5812|NULL |212 |605 |Visa |Credit |4542762527059522|1
2/2022|70 |YES |1 |$13500 |01/2009 |2009 |No
|605 |42 |73 |1977 |10 |Male |663 Summit Boulevard
|40.64 |-73.94 |$23316 |$47542 |$2667 |725 |3 |5812|Ea
ting Places and Restaurants |14358065 |No |
|14358070|2014-05-03 12:03:00|1241 |3501 |$98.06 |Swipe Transaction |86410 |Saint Henry
|OH |45883.0|5211|NULL |3501 |1241 |Visa |Debit |4617983811742101|0
3/2023|469|YES |2 |$9335 |06/2010 |2010 |No
|1241 |40 |69 |1979 |3 |Female|159 Plum Avenue
|40.43 |-84.38 |$24227 |$49396 |$39549 |814 |3 |5211|Lu
mber and Building Materials |14358070 |No |
|16093582|2015-05-15 13:18:00|1638 |4296 |$14.98 |Online Transaction|16798 |ONLINE
|NULL |NULL |4121|NULL |4296 |1638 |Mastercard|Debit |5445308987864903|1
0/2021|190|YES |1 |$18281 |08/2009 |2009 |No
|1638 |79 |64 |1940 |8 |Female|6337 Spruce Street
|44.67 |-93.24 |$34186 |$57824 |$31354 |668 |5 |4121|Ta
xicabs and Limousines |16093582 |No |
|16093586|2015-05-15 13:19:00|571 |5960 |$57.32 |Chip Transaction |32175 |Martinsburg
|WV |25404.0|7538|NULL |5960 |571 |Mastercard|Debit |5060522440059037|0
2/2017|556|YES |1 |$12247 |12/2009 |2009 |No
|571 |42 |68 |1977 |11 |Male |8580 Valley Stream Avenue
|39.46 |-77.96 |$18003 |$36711 |$84949 |739 |4 |7538|Au
tomotive Service Shops |16093586 |No |
|16093589|2015-05-15 13:19:00|1963 |3317 |$26.81 |Swipe Transaction |88852 |Vacaville
|CA |95687.0|4121|NULL |3317 |1963 |Mastercard|Debit |5735308019363324|0
2/2024|502|YES |1 |$25297 |06/2006 |2006 |No
|1963 |98 |69 |1921 |11 |Male |468 Spruce Street
|38.35 |-121.93 |$26137 |$33869 |$370 |821 |7 |4121|Ta
xicabs and Limousines |NULL |NULL |
|16093595|2015-05-15 13:21:00|1107 |8 |$135.21|Swipe Transaction |16041 |South Ozone P
ark|NY |11420.0|7802|NULL |8 |1107 |Mastercard|Credit |5462760953855576
|09/2021|665|NO |2 |$10300 |01/1998 |2006 |No
|1107 |71 |69 |1949 |1 |Female|836 Summit Boulevard
|42.23 |-76.34 |$18323 |$40516 |$6908 |698 |5 |7802|Re
creational Sports, Clubs |16093595 |No |
|16093599|2015-05-15 13:22:00|419 |3587 |$24.86 |Chip Transaction |99483 |Murphys
|CA |95247.0|5912|NULL |3587 |419 |Visa |Debit |4298987544338030|0
3/2020|757|YES |1 |$16731 |06/2014 |2014 |No
|419 |60 |66 |1959 |11 |Male |933 Valley Lane
|38.14 |-120.45 |$16448 |$33540 |$74530 |734 |5 |5912|Dr
ug Stores and Pharmacies |16093599 |No |
|17838219|2016-05-23 13:02:00|185 |2711 |$67.48 |Online Transaction|73186 |ONLINE
|NULL |NULL |4814|NULL |2711 |185 |Visa |Credit |4718517475996018|0
1/2021|492|YES |2 |$5700 |04/2012 |2012 |No
|185 |47 |67 |1973 |1 |Female|276 Fifth Boulevard
|40.66 |-74.19 |$15175 |$30942 |$71066 |779 |3 |4814|Te
lecommunication Services |17838219 |No |
|17838233|2016-05-23 13:04:00|1648 |5282 |$-85.00|Chip Transaction |43293 |Morris Plains
|NJ |7950.0 |5499|NULL |5282 |1648 |Visa |Credit |4699987716360968|0
6/2022|973|YES |1 |$36700 |10/2013 |2013 |No
|1648 |66 |69 |1953 |5 |Female|211 Valley Street
|40.76 |-74.59 |$91180 |$185909 |$461854 |621 |5 |5499|Mi
scellaneous Food Stores |17838233 |No |
+--------+-------------------+---------+-------+-------+------------------+-----------+-------------
---+--------------+-------+----+-------+-------+--------------+----------+---------+----------------
+-------+---+--------+----------------+------------+--------------+---------------------+-----------
-----+---------+-----------+--------------+----------+-----------+------+---------------------------
-----+--------+---------+-----------------+-------------+----------+------------+----------------+--
--+----------------------------------+--------------+--------+
only showing top 20 rows

Criação de View temporária para abt

In [112… abt_01.createOrReplaceTempView('tb_abt_01')

In [113… dataframe_volumetry(abt_01, 'tb_abt_01')

Volumetria do tb_abt_01
Linhas: 13305915
Colunas: 43

5.1. Análise de dados da ABT

Panorama geral dos dados

In [114… query_01 = spark.sql('''


SELECT
YEAR(TO_DATE(date, 'yyyy-MM-dd')) AS ANO,
COUNT(id) AS QTDE_TRANSACOES,
COUNT(DISTINCT client_id) AS QTDE_CLIENTES,
SUM(CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2))) AS TOTAL_TRANSACOES,
COUNT(DISTINCT card_id) AS QTDE_CARTOES,
COUNT(DISTINCT merchant_id) AS QTDE_COMERCIANTES,
COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) AS QTDE_FRAUDES,
ROUND(100 * COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) / COUNT(*), 3) AS TAXA_FRAUDE
FROM
tb_abt_01
GROUP BY
ANO
ORDER BY
ANO;
''').show(20, truncate=False)
+----+---------------+-------------+----------------+------------+-----------------+------------+---
--------+
|ANO |QTDE_TRANSACOES|QTDE_CLIENTES|TOTAL_TRANSACOES|QTDE_CARTOES|QTDE_COMERCIANTES|QTDE_FRAUDES|TAX
A_FRAUDE|
+----+---------------+-------------+----------------+------------+-----------------+------------+---
--------+
|2010|1240880 |1137 |54232556.12 |2896 |27901 |2573 |0.2
07 |
|2011|1290770 |1167 |55778904.96 |3137 |28153 |37 |0.0
03 |
|2012|1321672 |1177 |56832410.86 |3245 |28710 |923 |0.0
7 |
|2013|1352808 |1190 |58284939.62 |3339 |29280 |1337 |0.0
99 |
|2014|1365537 |1195 |58617820.51 |3418 |28987 |664 |0.0
49 |
|2015|1388065 |1204 |59514007.43 |3473 |29616 |2189 |0.1
58 |
|2016|1392117 |1210 |59844028.90 |3497 |29600 |2448 |0.1
76 |
|2017|1399308 |1209 |59628480.63 |3492 |29781 |172 |0.0
12 |
|2018|1394792 |1208 |59627317.94 |3488 |29435 |1629 |0.1
17 |
|2019|1159966 |1206 |49475055.31 |3437 |27516 |1360 |0.1
17 |
+----+---------------+-------------+----------------+------------+-----------------+------------+---
--------+

Análise das transações por cartão

In [115… query_02 = spark.sql('''


SELECT
has_chip AS CONTEM_CHIP,
use_chip AS METODO_AUTENTICACAO,
COUNT(*) AS QTDE_TRANSACOES,
COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) AS QTDE_FRAUDES,
ROUND(100 * COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) / COUNT(*), 3) AS TAXA_FRAUDE
FROM
tb_abt_01
GROUP BY
has_chip, use_chip
ORDER BY
QTDE_TRANSACOES DESC;
''').show(20, truncate=False)

+-----------+-------------------+---------------+------------+-----------+
|CONTEM_CHIP|METODO_AUTENTICACAO|QTDE_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|
+-----------+-------------------+---------------+------------+-----------+
|YES |Swipe Transaction |5774805 |801 |0.014 |
|YES |Chip Transaction |4780818 |3176 |0.066 |
|YES |Online Transaction |1419172 |8155 |0.575 |
|NO |Swipe Transaction |1192380 |576 |0.048 |
|NO |Online Transaction |138740 |624 |0.45 |
+-----------+-------------------+---------------+------------+-----------+

Transações por faixa de valor

In [116… query_03 = spark.sql('''


SELECT
CASE
WHEN CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2)) BETWEEN 0 AND 50 THEN '0-50'
WHEN CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2)) BETWEEN 51 AND 100 THEN '51-100'
WHEN CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2)) BETWEEN 101 AND 500 THEN '101-500'
WHEN CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2)) BETWEEN 501 AND 1000 THEN '501-100
ELSE '1000+'
END AS value_range,
COUNT(*) AS QTDE_TRANSACOES,
COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) AS QTDE_FRAUDES,
ROUND(100 * COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) / COUNT(*), 3) AS TAXA_FRAUDE
FROM
tb_abt_01
GROUP BY
value_range
ORDER BY
QTDE_TRANSACOES DESC;
''').show(20, truncate=False)

+-----------+---------------+------------+-----------+
|value_range|QTDE_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|
+-----------+---------------+------------+-----------+
|0-50 |8198871 |4943 |0.06 |
|51-100 |2939552 |2761 |0.094 |
|101-500 |1366430 |4690 |0.343 |
|1000+ |767187 |687 |0.09 |
|501-1000 |33875 |251 |0.741 |
+-----------+---------------+------------+-----------+

Análise das transações e valores fraudulentos

In [117… query_04 = spark.sql('''


SELECT
YEAR(TO_DATE(date, 'yyyy-MM-dd')) AS ANO,
COUNT(*) AS TOTAL_TRANSACOES,
COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) AS QTDE_FRAUDES,
ROUND(100 * COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) / COUNT(*), 2) AS TAXA_FRAUDE,
AVG(CASE WHEN is_fraud = 'Yes' AND CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2)) >= 0
THEN CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2))
END) AS MEDIA_VALORES_FRAUDES,
SUM(CASE WHEN is_fraud = 'Yes' AND CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2)) >= 0
THEN CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2))
END) AS TOTAL_VALORES_FRAUDES,
MAX(CASE WHEN is_fraud = 'Yes' THEN CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2))
END) AS MAIOR_VALOR_FRAUDADO,
MIN(CASE WHEN is_fraud = 'Yes' AND CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2)) >= 0
THEN CAST(REPLACE(amount, '$', '') AS DECIMAL(10,2))
END) AS MENOR_VALOR_FRAUDADO
FROM
tb_abt_01
GROUP BY
ANO
ORDER BY
ANO;
''').show(20, truncate=False)
+----+----------------+------------+-----------+---------------------+---------------------+--------
------------+--------------------+
|ANO |TOTAL_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|MEDIA_VALORES_FRAUDES|TOTAL_VALORES_FRAUDES|MAIOR_VA
LOR_FRAUDADO|MENOR_VALOR_FRAUDADO|
+----+----------------+------------+-----------+---------------------+---------------------+--------
------------+--------------------+
|2010|1240880 |2573 |0.21 |141.412778 |345612.83 |4978.45
|0.00 |
|2011|1290770 |37 |0.0 |170.364706 |5792.40 |540.30
|0.22 |
|2012|1321672 |923 |0.07 |90.272121 |80432.46 |1073.75
|0.01 |
|2013|1352808 |1337 |0.1 |135.406258 |172236.76 |2263.61
|0.01 |
|2014|1365537 |664 |0.05 |129.520814 |82763.80 |2251.32
|0.02 |
|2015|1388065 |2189 |0.16 |141.414868 |300365.18 |2930.44
|0.03 |
|2016|1392117 |2448 |0.18 |145.727756 |343480.32 |2505.58
|0.03 |
|2017|1399308 |172 |0.01 |90.885731 |15541.46 |1198.72
|0.06 |
|2018|1394792 |1629 |0.12 |89.064544 |140721.98 |1343.48
|0.00 |
|2019|1159966 |1360 |0.12 |92.014729 |122379.59 |1244.41
|0.00 |
+----+----------------+------------+-----------+---------------------+---------------------+--------
------------+--------------------+

Análise da distribuição das transações por faixa etária

In [118… query_05 = spark.sql('''


SELECT
CASE
WHEN current_age BETWEEN 0 AND 10 THEN '0-10'
WHEN current_age BETWEEN 11 AND 20 THEN '11-20'
WHEN current_age BETWEEN 21 AND 30 THEN '21-30'
WHEN current_age BETWEEN 31 AND 40 THEN '31-40'
WHEN current_age BETWEEN 41 AND 50 THEN '41-50'
WHEN current_age BETWEEN 51 AND 60 THEN '51-60'
WHEN current_age BETWEEN 61 AND 70 THEN '61-70'
WHEN current_age BETWEEN 71 AND 80 THEN '71-80'
WHEN current_age BETWEEN 81 AND 90 THEN '81-90'
ELSE '90+'
END AS FAIXA_ETARIA,
COUNT(*) AS QTDE_TRANSACOES,
COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) AS QTDE_FRAUDES,
MIN(current_age) AS min_age
FROM tb_abt_01
GROUP BY FAIXA_ETARIA
ORDER BY min_age;
''').show(20, truncate=False)
+------------+---------------+------------+-------+
|FAIXA_ETARIA|QTDE_TRANSACOES|QTDE_FRAUDES|min_age|
+------------+---------------+------------+-------+
|21-30 |428671 |318 |23 |
|31-40 |2377562 |2041 |31 |
|41-50 |3396784 |3234 |41 |
|51-60 |2943032 |2880 |51 |
|61-70 |2105384 |2479 |61 |
|71-80 |970658 |1136 |71 |
|81-90 |864655 |1077 |81 |
|90+ |219169 |167 |91 |
+------------+---------------+------------+-------+

Transações por estado do estabelecimento comercial

In [119… query_06 = spark.sql('''


SELECT
COALESCE(merchant_state, 'ONLINE') AS UF_COMERCIANTE,
COUNT(*) AS QTDE_TRANSACOES,
COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) AS QTDE_FRAUDES,
ROUND(100 * COUNT(CASE WHEN is_fraud != 'No' THEN 1 END) / COUNT(*), 3) AS TAXA_FRAUDE
FROM
tb_abt_01
GROUP BY
UF_COMERCIANTE
ORDER BY
QTDE_TRANSACOES DESC;
''').show(20, truncate=False)

+--------------+---------------+------------+-----------+
|UF_COMERCIANTE|QTDE_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|
+--------------+---------------+------------+-----------+
|ONLINE |1563700 |8779 |0.561 |
|CA |1427087 |127 |0.009 |
|TX |1010207 |76 |0.008 |
|NY |857510 |58 |0.007 |
|FL |701623 |63 |0.009 |
|OH |484122 |316 |0.065 |
|IL |467931 |36 |0.008 |
|NC |429427 |44 |0.01 |
|PA |417766 |33 |0.008 |
|MI |397970 |39 |0.01 |
|GA |368206 |22 |0.006 |
|NJ |322227 |42 |0.013 |
|IN |312470 |26 |0.008 |
|WA |286525 |23 |0.008 |
|TN |284709 |14 |0.005 |
|VA |230685 |26 |0.011 |
|AZ |195940 |11 |0.006 |
|MO |195854 |32 |0.016 |
|MD |193776 |18 |0.009 |
|MN |178808 |14 |0.008 |
+--------------+---------------+------------+-----------+
only showing top 20 rows

Análise de transações por gênero

In [120… query_07 = spark.sql('''


SELECT
COALESCE(gender, 'Desconhecido') AS GENERO,
COUNT(*) AS QTDE_TRANSACOES,
COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) AS QTDE_FRAUDES,
ROUND(100 * COUNT(CASE WHEN is_fraud = 'Yes' THEN 1 END) / COUNT(*), 3) AS TAXA_FRAUDE
FROM
tb_abt_01
GROUP BY
GENERO
ORDER BY
QTDE_TRANSACOES DESC;
''').show(20, truncate=False)

+------+---------------+------------+-----------+
|GENERO|QTDE_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|
+------+---------------+------------+-----------+
|Female|6815916 |6982 |0.102 |
|Male |6489999 |6350 |0.098 |
+------+---------------+------------+-----------+

Análise de transações por faixa etária e gênero

In [121… query_08 = spark.sql('''


SELECT
FAIXA_ETARIA,
gender,
COUNT(*) AS QTDE_TRANSACOES,
COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) AS QTDE_FRAUDES,
ROUND(100 * COUNT(CASE WHEN is_fraud = 'Yes' THEN id END) / COUNT(*), 3) AS TAXA_FRAUDE
FROM
(SELECT
CASE
WHEN current_age BETWEEN 0 AND 10 THEN '0-10'
WHEN current_age BETWEEN 11 AND 20 THEN '11-20'
WHEN current_age BETWEEN 21 AND 30 THEN '21-30'
WHEN current_age BETWEEN 31 AND 40 THEN '31-40'
WHEN current_age BETWEEN 41 AND 50 THEN '41-50'
WHEN current_age BETWEEN 51 AND 60 THEN '51-60'
WHEN current_age BETWEEN 61 AND 70 THEN '61-70'
WHEN current_age BETWEEN 71 AND 80 THEN '71-80'
WHEN current_age BETWEEN 81 AND 90 THEN '81-90'
ELSE '90+'
END AS FAIXA_ETARIA,
gender,
id,
is_fraud
FROM tb_abt_01) AS DADOS_FAIXA_ETARIA
GROUP BY
FAIXA_ETARIA, gender
ORDER BY
QTDE_TRANSACOES DESC;
''').show(20, truncate=False)
+------------+------+---------------+------------+-----------+
|FAIXA_ETARIA|gender|QTDE_TRANSACOES|QTDE_FRAUDES|TAXA_FRAUDE|
+------------+------+---------------+------------+-----------+
|41-50 |Female|1764919 |1597 |0.09 |
|41-50 |Male |1631865 |1637 |0.1 |
|51-60 |Male |1512801 |1432 |0.095 |
|51-60 |Female|1430231 |1448 |0.101 |
|31-40 |Male |1243654 |1008 |0.081 |
|31-40 |Female|1133908 |1033 |0.091 |
|61-70 |Female|1062819 |1332 |0.125 |
|61-70 |Male |1042565 |1147 |0.11 |
|81-90 |Female|558634 |715 |0.128 |
|71-80 |Male |487802 |519 |0.106 |
|71-80 |Female|482856 |617 |0.128 |
|81-90 |Male |306021 |362 |0.118 |
|21-30 |Female|256586 |159 |0.062 |
|21-30 |Male |172085 |159 |0.092 |
|90+ |Female|125963 |81 |0.064 |
|90+ |Male |93206 |86 |0.092 |
+------------+------+---------------+------------+-----------+

5.2. Análise do target da ABT

Divisão dos dados pelo target

In [122… query_09 = spark.sql('''


SELECT
is_fraud,
COUNT(*) AS QTDE_TRANSACOES,
ROUND(100 * COUNT(*) / (SELECT COUNT(*) FROM tb_abt_01), 3) AS PERCENTUAL
FROM
tb_abt_01
GROUP BY
is_fraud
ORDER BY
QTDE_TRANSACOES DESC;
''').show(20, truncate=False)

+--------+---------------+----------+
|is_fraud|QTDE_TRANSACOES|PERCENTUAL|
+--------+---------------+----------+
|No |8901631 |66.9 |
|NULL |4390952 |33.0 |
|Yes |13332 |0.1 |
+--------+---------------+----------+

Distribuição do target por ano

In [123… query_10 = spark.sql('''


SELECT
YEAR(TO_DATE(date, 'yyyy-MM-dd')) AS ANO,
is_fraud,
COUNT(*) AS QTDE_TRANSACOES,
ROUND(100 * COUNT(*) / (
SELECT
COUNT(*)
FROM
tb_abt_01
WHERE
YEAR(TO_DATE(date, 'yyyy-MM-dd')) = YEAR(TO_DATE(date, 'yyyy-MM-dd'))
), 2) AS PERCENTUAL
FROM
tb_abt_01
GROUP BY
ANO, is_fraud
ORDER BY
ANO, is_fraud DESC;
''').show(50, truncate=False)

+----+--------+---------------+----------+
|ANO |is_fraud|QTDE_TRANSACOES|PERCENTUAL|
+----+--------+---------------+----------+
|2010|Yes |2573 |0.02 |
|2010|No |828956 |6.23 |
|2010|NULL |409351 |3.08 |
|2011|Yes |37 |0.0 |
|2011|No |863391 |6.49 |
|2011|NULL |427342 |3.21 |
|2012|Yes |923 |0.01 |
|2012|No |884498 |6.65 |
|2012|NULL |436251 |3.28 |
|2013|Yes |1337 |0.01 |
|2013|No |905967 |6.81 |
|2013|NULL |445504 |3.35 |
|2014|Yes |664 |0.0 |
|2014|No |914409 |6.87 |
|2014|NULL |450464 |3.39 |
|2015|Yes |2189 |0.02 |
|2015|No |928035 |6.97 |
|2015|NULL |457841 |3.44 |
|2016|Yes |2448 |0.02 |
|2016|No |930314 |6.99 |
|2016|NULL |459355 |3.45 |
|2017|Yes |172 |0.0 |
|2017|No |937112 |7.04 |
|2017|NULL |462024 |3.47 |
|2018|Yes |1629 |0.01 |
|2018|No |932970 |7.01 |
|2018|NULL |460193 |3.46 |
|2019|Yes |1360 |0.01 |
|2019|No |775979 |5.83 |
|2019|NULL |382627 |2.88 |
+----+--------+---------------+----------+

5.3. Gerando uma amostra representativa da ABT


In [124… # Amostra 50% dos dados
abt_02 = abt_01.sample(withReplacement=False, fraction=0.50, seed=42)

# Cache o DataFrame para garantir consistência entre as ações


abt_02.cache()

Out[124… DataFrame[id: int, date: timestamp, client_id: int, card_id: int, amount: string, use_chip: strin
g, merchant_id: int, merchant_city: string, merchant_state: string, zip: double, mcc: int, errors:
string, id_card: int, client_id_card: int, card_brand: string, card_type: string, card_number: big
int, expires: string, cvv: int, has_chip: string, num_cards_issued: int, credit_limit: string, acc
t_open_date: string, year_pin_last_changed: int, card_on_dark_web: string, id_client: int, current
_age: int, retirement_age: int, birth_year: int, birth_month: int, gender: string, address: strin
g, latitude: double, longitude: double, per_capita_income: string, yearly_income: string, total_de
bt: string, credit_score: int, num_credit_cards: int, code: int, description: string, transaction_
id: int, is_fraud: string]

In [125… abt_02.show(20, truncate=False)


+-------+-------------------+---------+-------+--------+------------------+-----------+-------------
---+--------------+-------+----+------+-------+--------------+----------+---------------+-----------
-----+-------+---+--------+----------------+------------+--------------+---------------------+------
----------+---------+-----------+--------------+----------+-----------+------+----------------------
-------+--------+---------+-----------------+-------------+----------+------------+----------------+
----+------------------------------------------+--------------+--------+
|id |date |client_id|card_id|amount |use_chip |merchant_id|merchant_city
|merchant_state|zip |mcc |errors|id_card|client_id_card|card_brand|card_type |card_number
|expires|cvv|has_chip|num_cards_issued|credit_limit|acct_open_date|year_pin_last_changed|card_on_dar
k_web|id_client|current_age|retirement_age|birth_year|birth_month|gender|address
|latitude|longitude|per_capita_income|yearly_income|total_debt|credit_score|num_credit_cards|code|de
scription |transaction_id|is_fraud|
+-------+-------------------+---------+-------+--------+------------------+-----------+-------------
---+--------------+-------+----+------+-------+--------------+----------+---------------+-----------
-----+-------+---+--------+----------------+------------+--------------+---------------------+------
----------+---------+-----------+--------------+----------+-----------+------+----------------------
-------+--------+---------+-----------------+-------------+----------+------------+----------------+
----+------------------------------------------+--------------+--------+
|7475806|2010-01-01 07:05:00|1840 |4568 |$2.02 |Swipe Transaction |35451 |Beaverton
|OR |97005.0|5812|NULL |4568 |1840 |Visa |Debit (Prepaid)|47333594183355
81|09/2021|67 |YES |2 |$4 |09/2004 |2008 |No
|1840 |46 |71 |1974 |2 |Female|576 Martin Luther King Street|4
5.49 |-122.8 |$21702 |$44249 |$103229 |706 |5 |5812|Eati
ng Places and Restaurants |7475806 |No |
|7477473|2010-01-01 13:08:00|538 |4161 |$7.48 |Swipe Transaction |26810 |Winterville
|NC |28590.0|5541|NULL |4161 |538 |Mastercard|Debit |58851056680249
39|12/2014|750|YES |2 |$6993 |08/2005 |2016 |No
|538 |66 |69 |1954 |2 |Female|7888 Fourth Street |3
5.3 |-77.15 |$14844 |$30265 |$36789 |814 |4 |5541|Serv
ice Stations |7477473 |No |
|7477784|2010-01-01 14:18:00|724 |2876 |$1.70 |Swipe Transaction |59935 |Cushing
|OK |74023.0|5499|NULL |2876 |724 |Mastercard|Debit |58323562249254
90|06/2024|245|YES |2 |$16476 |05/2005 |2008 |No
|724 |45 |72 |1974 |5 |Female|819 El Camino Boulevard |3
5.97 |-96.76 |$17237 |$35142 |$107898 |731 |4 |5499|Misc
ellaneous Food Stores |NULL |NULL |
|7477811|2010-01-01 14:25:00|377 |1175 |$-53.00 |Swipe Transaction |43293 |Withee
|WI |54498.0|5499|NULL |1175 |377 |Mastercard|Debit |50094000513760
27|11/2023|417|YES |1 |$30403 |02/2009 |2014 |No
|377 |80 |67 |1940 |1 |Female|305 Pine Avenue |4
7.39 |-122.26 |$24884 |$39110 |$363 |750 |5 |5499|Misc
ellaneous Food Stores |NULL |NULL |
|7478410|2010-01-01 16:55:00|1362 |2145 |$-295.00|Swipe Transaction |96185 |Bladensburg
|MD |20710.0|7011|NULL |2145 |1362 |Mastercard|Debit |55666956889170
47|03/2017|309|NO |2 |$29708 |03/2007 |2009 |No
|1362 |58 |67 |1962 |1 |Male |3385 Hill Lane |3
8.78 |-77.27 |$35563 |$72510 |$44317 |727 |4 |7011|Lodg
ing - Hotels, Motels, Resorts |NULL |NULL |
|7478830|2010-01-01 19:25:00|1466 |5884 |$17.59 |Online Transaction|16798 |ONLINE
|NULL |NULL |4121|NULL |5884 |1466 |Mastercard|Debit |59468541291197
03|09/2020|405|YES |1 |$1866 |12/2007 |2014 |No
|1466 |36 |75 |1983 |4 |Female|3194 Norfolk Street |3
8.64 |-75.61 |$17624 |$35933 |$23451 |812 |2 |4121|Taxi
cabs and Limousines |NULL |NULL |
|7479105|2010-01-01 21:02:00|1693 |5940 |$4.33 |Online Transaction|85247 |ONLINE
|NULL |NULL |5815|NULL |5940 |1693 |Mastercard|Debit |51281046177972
18|03/2017|726|YES |1 |$33506 |12/2008 |2011 |No
|1693 |36 |69 |1983 |4 |Female|478 East Drive |3
3.61 |-111.89 |$36300 |$74016 |$85204 |702 |2 |5815|Digi
tal Goods - Media, Books, Apps |7479105 |No |
|7480284|2010-01-02 09:11:00|1674 |2873 |$27.78 |Swipe Transaction |60569 |Jonesboro
|AR |72401.0|5300|NULL |2873 |1674 |Amex |Credit |36652095487483
9 |05/2022|447|YES |2 |$8800 |05/2005 |2011 |No
|1674 |70 |64 |1949 |4 |Male |5073 Wessex Avenue |3
5.49 |-90.35 |$14172 |$26858 |$11245 |712 |2 |5300|Whol
esale Clubs |7480284 |No |
|7480339|2010-01-02 09:27:00|1070 |4138 |$35.20 |Swipe Transaction |99256 |Marion
|IA |52302.0|5411|NULL |4138 |1070 |Mastercard|Debit |55882417596203
90|08/2022|902|YES |1 |$28666 |08/2004 |2010 |No
|1070 |61 |65 |1958 |11 |Male |841 Wessex Boulevard |4
2.03 |-91.58 |$25275 |$51528 |$58509 |745 |6 |5411|Groc
ery Stores, Supermarkets |NULL |NULL |
|7480412|2010-01-02 09:46:00|509 |4588 |$5.54 |Swipe Transaction |60569 |Charmco
|WV |25958.0|5300|NULL |4588 |509 |Visa |Debit |42621810697667
92|07/2022|519|YES |1 |$12721 |09/2005 |2015 |No
|509 |33 |66 |1986 |7 |Male |239 Sussex Drive |3
8.41 |-82.43 |$21842 |$44534 |$107410 |702 |4 |5300|Whol
esale Clubs |7480412 |No |
|7480788|2010-01-02 11:19:00|1772 |5918 |$8.77 |Swipe Transaction |12029 |Rancho Cucamo
nga|CA |91730.0|5411|NULL |5918 |1772 |Visa |Credit |41483180942
87271|12/2020|313|YES |2 |$12800 |12/2007 |2009 |No
|1772 |47 |70 |1972 |10 |Male |9070 West Street |3
4.09 |-117.58 |$24603 |$50163 |$97188 |761 |4 |5411|Groc
ery Stores, Supermarkets |NULL |NULL |
|7482356|2010-01-02 18:04:00|1936 |5914 |$7.45 |Swipe Transaction |21739 |Richmond
|VT |5477.0 |5300|NULL |5914 |1936 |Visa |Debit |46532150184491
89|09/2014|624|YES |1 |$27006 |12/2007 |2007 |No
|1936 |86 |68 |1933 |7 |Female|406 El Camino Boulevard |4
4.4 |-73.0 |$26951 |$35685 |$1135 |714 |5 |5300|Whol
esale Clubs |7482356 |No |
|7482441|2010-01-02 18:36:00|1196 |4542 |$100.00 |Swipe Transaction |27092 |Lake Havasu C
ity|AZ |86405.0|4829|NULL |4542 |1196 |Visa |Credit |48989156038
19946|06/2020|656|YES |1 |$11100 |09/2003 |2011 |No
|1196 |60 |61 |1959 |6 |Female|498 Elm Lane |3
5.24 |-113.76 |$17113 |$34895 |$56622 |820 |4 |4829|Mone
y Transfer |NULL |NULL |
|7482539|2010-01-02 19:16:00|1096 |3012 |$14.76 |Swipe Transaction |68765 |Syracuse
|NY |13212.0|5411|NULL |3012 |1096 |Mastercard|Credit |52578157627430
73|10/2020|52 |YES |2 |$5500 |05/2009 |2009 |No
|1096 |59 |67 |1960 |10 |Male |1089 Norfolk Avenue |4
3.04 |-76.14 |$20294 |$41379 |$95988 |681 |2 |5411|Groc
ery Stores, Supermarkets |NULL |NULL |
|7482814|2010-01-02 21:36:00|1896 |4974 |$16.08 |Swipe Transaction |60569 |Plymouth
|MI |48170.0|5300|NULL |4974 |1896 |Visa |Credit |44683556959644
57|12/2023|955|YES |1 |$14000 |10/2002 |2008 |No
|1896 |50 |79 |1969 |9 |Female|6695 River Lane |4
1.91 |-83.38 |$19736 |$40246 |$74352 |641 |5 |5300|Whol
esale Clubs |7482814 |No |
|7482900|2010-01-02 22:24:00|618 |3411 |$16.99 |Swipe Transaction |92741 |Lynwood
|CA |90262.0|5813|NULL |3411 |618 |Visa |Debit |43772507791612
96|02/2024|393|YES |1 |$12496 |06/2008 |2010 |No
|618 |32 |67 |1987 |8 |Male |3525 Second Lane |3
3.92 |-118.2 |$13632 |$27795 |$72318 |701 |4 |5813|Drin
king Places (Alcoholic Beverages) |NULL |NULL |
|7484660|2010-01-03 11:14:00|1857 |5089 |$39.77 |Swipe Transaction |91128 |Morris Plains
|NJ |7950.0 |5411|NULL |5089 |1857 |Mastercard|Credit |55715713663143
76|07/2024|126|YES |1 |$27700 |10/2007 |2013 |No
|1857 |32 |66 |1987 |8 |Male |4063 Burns Boulevard |4
0.77 |-74.39 |$47698 |$97248 |$197100 |775 |5 |5411|Groc
ery Stores, Supermarkets |7484660 |No |
|7484831|2010-01-03 11:52:00|1654 |2915 |$67.13 |Swipe Transaction |43293 |Naugatuck
|CT |6770.0 |5499|NULL |2915 |1654 |Visa |Credit |47579219890095
14|10/2023|921|YES |1 |$14900 |05/2006 |2014 |No
|1654 |43 |66 |1976 |4 |Male |385 Pine Drive |4
1.47 |-71.3 |$24194 |$49328 |$6197 |693 |5 |5499|Misc
ellaneous Food Stores |NULL |NULL |
|7486009|2010-01-03 16:19:00|1079 |5826 |$188.80 |Swipe Transaction |5373 |Rockville Cen
tre|NY |11570.0|4900|NULL |5826 |1079 |Amex |Credit |36282213713
5948 |10/2022|44 |YES |1 |$13400 |12/2005 |2010 |No
|1079 |65 |60 |1954 |11 |Female|422 Madison Lane |4
0.66 |-73.63 |$48994 |$103294 |$39076 |831 |3 |4900|Util
ities - Electric, Gas, Water, Sanitary|7486009 |No |
|7486306|2010-01-03 17:49:00|171 |4490 |$41.16 |Swipe Transaction |67570 |Greenville
|SC |29607.0|5311|NULL |4490 |171 |Mastercard|Debit (Prepaid)|56480562001449
57|12/2024|793|YES |2 |$46 |09/1996 |2007 |No
|171 |43 |70 |1976 |5 |Female|5194 Grant Street |3
4.83 |-82.37 |$24314 |$49577 |$142314 |694 |3 |5311|Depa
rtment Stores |NULL |NULL |
+-------+-------------------+---------+-------+--------+------------------+-----------+-------------
---+--------------+-------+----+------+-------+--------------+----------+---------------+-----------
-----+-------+---+--------+----------------+------------+--------------+---------------------+------
----------+---------+-----------+--------------+----------+-----------+------+----------------------
-------+--------+---------+-----------------+-------------+----------+------------+----------------+
----+------------------------------------------+--------------+--------+
only showing top 20 rows

In [126… # Cria a view temporária para a abt


abt_02.createOrReplaceTempView('tb_abt_02')

In [127… query_09 = spark.sql('''


SELECT
is_fraud,
COUNT(*) AS QTDE_TRANSACOES,
ROUND(100 * COUNT(*) / (SELECT COUNT(*) FROM tb_abt_02), 3) AS PERCENTUAL
FROM
tb_abt_02
GROUP BY
is_fraud
ORDER BY
QTDE_TRANSACOES DESC;
''').show(20, truncate=False)

+--------+---------------+----------+
|is_fraud|QTDE_TRANSACOES|PERCENTUAL|
+--------+---------------+----------+
|No |4450946 |66.913 |
|NULL |2194300 |32.988 |
|Yes |6640 |0.1 |
+--------+---------------+----------+

6. Salvando a ABT em formato parquet


In [128… dataframe_volumetry(abt_02, 'tb_abt_02')

Volumetria do tb_abt_02
Linhas: 6651886
Colunas: 43

In [129… # Diretório onde os dados serão salvos


caminho = f'dados/ABT/'

# Verifica se o diretório ABT já existe


if os.path.exists(caminho):
# Exportar para Parquet
abt_02.write.option('compression', 'snappy').parquet(caminho, mode='overwrite')
# Valida a quantidade de linhas lidas do Parquet
read_abt_02 = spark.read.parquet(caminho)
print(f'\nA ABT parquet tem {read_abt_02.count()} linhas.')
else:
print(f'\nOcorreu um erro: o diretório "{caminho}" não existe!')

A ABT parquet tem 6651886 linhas.

Você também pode gostar

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy