0% found this document useful (0 votes)
26 views57 pages

GODSight FYP FinalReport

The GodSight framework is an open-source, extendable on-chain analysis platform designed for blockchain networks, currently supporting Avalanche and Bitcoin, and allows users to define custom metrics for data analysis. Additionally, a Bitcoin anomaly detection model was developed using a modified TabNet architecture, which outperformed standard models in identifying illicit addresses. This research highlights the effectiveness of GodSight for comprehensive blockchain analysis and the enhanced capabilities of the modified TabNet model for anomaly detection.

Uploaded by

dilusha.19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views57 pages

GODSight FYP FinalReport

The GodSight framework is an open-source, extendable on-chain analysis platform designed for blockchain networks, currently supporting Avalanche and Bitcoin, and allows users to define custom metrics for data analysis. Additionally, a Bitcoin anomaly detection model was developed using a modified TabNet architecture, which outperformed standard models in identifying illicit addresses. This research highlights the effectiveness of GodSight for comprehensive blockchain analysis and the enhanced capabilities of the modified TabNet model for anomaly detection.

Uploaded by

dilusha.19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

OPEN-SOURCE EXTENDABLE

ON-CHAIN ANALYSIS FRAMEWORK

&

ANOMALY ANALYSIS FOR BLOCKCHAINS

Final Year Project - Final Report

By

Group GodSight

190094B - Abegunawardhana U. K. K. P.
190093T - Bodaragama D. B.
190478E - Pulle D. M. P.

Under the supervision of


Dr. Sapumal Ahangama
Prof. Indika Perera

Department of Computer Science & Engineering


University of Moratuwa
Sri Lanka
May 2024
ABSTRACT

Blockchain technology requires sophisticated tools to understand transaction behavior


and detect anomalies, ensuring network security and integrity. In response, this research
introduces the GodSight framework, an open-source, extendable on-chain analysis plat-
form that supports data extraction and customizable user-defined metrics. The frame-
work currently includes support for the Avalanche and Bitcoin networks and is designed
for expandability, allowing users to integrate additional blockchains for on-chain analysis.
It also provides options for users to define custom metrics for comprehensive blockchain
data analysis, delivering visualizations via a responsive dashboard application. These
features make GodSight a versatile solution for detailed blockchain analysis.
To complement the framework and address the growing need for anomaly detection,
a tabular-based Bitcoin anomaly dataset for illicit addresses was developed using the
BABD-13 dataset. This dataset simplifies Bitcoin address behavior analysis using statis-
tical feature extraction, providing a practical alternative to graph-based representations.
Several machine learning models were evaluated, including tree-based and tabular neu-
ral network models. A modified TabNet model, incorporating a 1D CNN base extraction
component and PReLU activation, achieved superior accuracy compared to standard Tab-
Net and Random Forest models. The feature extraction component allowed the modified
TabNet model to better distinguish the boundary between illicit and non-illicit Bitcoin
addresses, resulting in improved anomaly detection.
This research effectively demonstrates the value of the GodSight framework for com-
prehensive blockchain analysis and highlights the modified TabNet model’s ability to
accurately identify illicit patterns in blockchain addresses.

ii
CONTENTS

1 INTRODUCTION 1

2 PROBLEM STATEMENT 3
2.1 Phase 1: On-Chain Analytics Platform . . . . . . . . . . . . . . . . . . . 3
2.2 Phase 2: Anomaly Detection System . . . . . . . . . . . . . . . . . . . . 3

3 MOTIVATION 4

4 LITERATURE REVIEW 5
4.1 Data Extraction from Blockchain Networks . . . . . . . . . . . . . . . . . 5
4.2 On-Chain Analysis Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 On-chain Analysis Platforms/Applications . . . . . . . . . . . . . . . . . 7
4.4 Blockchain Anomaly Analysis . . . . . . . . . . . . . . . . . . . . . . . . 7
4.4.1 ANOMALY DETECTION WITH UNSUPERVISED LEARNING 8
4.4.2 GRAPH-BASED APPROACHES FOR ANOMALY DETECTION 8
4.4.3 STATISTICAL APPROACHES . . . . . . . . . . . . . . . . . . . 9
4.4.4 TREE-BASED MODELS . . . . . . . . . . . . . . . . . . . . . . 9
4.4.5 NEURAL NETWORKS IN TABULAR MODELS . . . . . . . . . 9

5 RESEARCH OBJECTIVES 11
5.1 Development of an Extendable Open-Source On-Chain Analysis Frame-
work for Blockchain Networks . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Creation of an Accurate Anomaly Detection Model for Blockchain Trans-
actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

6 METHODOLOGY 12
6.1 Open-Source SaaS Platform . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.1.1 Metrics Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.1.1.1 Metric Categorization . . . . . . . . . . . . . . . . . . . 12
6.1.1.2 Metrics Formulation . . . . . . . . . . . . . . . . . . . . 14
6.1.2 System Design & Architecture . . . . . . . . . . . . . . . . . . . . 21
6.1.2.1 Framework Architecture . . . . . . . . . . . . . . . . . . 21
6.1.2.2 Extendibility . . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.2.3 Customizability . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Bitcoin Anomaly Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii
6.2.1 BABD-13: DATASET OVERVIEW . . . . . . . . . . . . . . . . . 25
6.2.2 TABULAR-BASED BITCOIN ANOMALY DATASET CREATION 27
6.2.2.1 Tabular-Based Dataset Approach . . . . . . . . . . . . . 27
6.2.2.2 Statistical Feature Extraction . . . . . . . . . . . . . . . 27
6.2.2.3 Importance of Transaction Time and Value . . . . . . . 27
6.2.2.4 Differentiation Between Input and Output Transaction
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2.2.5 Cross-Featuring Between Input and Output Transactions 28
6.2.2.6 Overview of Newly Created Features . . . . . . . . . . . 28
6.2.3 MACHINE LEARNING WORKFLOW . . . . . . . . . . . . . . . 28
6.2.3.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . 28
6.2.3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . 28
6.2.3.3 Handling Data Imbalance . . . . . . . . . . . . . . . . . 29
6.2.3.4 Data Standardization . . . . . . . . . . . . . . . . . . . 29
6.2.3.5 Principal Component Analysis (PCA) . . . . . . . . . . 29
6.2.4 MODEL TRAINING AND ARCHITECTURE . . . . . . . . . . . 29
6.2.4.1 Tree-Based Models and Tabular NN Models . . . . . . . 30
6.2.4.2 Modified TabNet Model Architecture . . . . . . . . . . . 30
6.2.4.3 Introduction of 1D CNN Feature Extraction . . . . . . . 31
6.2.5 EVALUATION METRICS . . . . . . . . . . . . . . . . . . . . . . 31
6.2.5.1 Imbalanced Test Set Description . . . . . . . . . . . . . 31
6.2.5.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 RESULTS AND DISCUSSION 33


7.1 Extendable Open-Source On-chain Framework . . . . . . . . . . . . . . . 33
7.1.1 Framework Performance and Effectiveness . . . . . . . . . . . . . 33
7.1.1.1 Speed of Data Processing and Accuracy of Data Extraction 33
7.1.1.2 Accuracy in Computation . . . . . . . . . . . . . . . . . 33
7.1.1.3 Responsiveness of the Dashboard . . . . . . . . . . . . . 34
7.1.2 Extensibility and Customization . . . . . . . . . . . . . . . . . . . 34
7.1.2.1 Framework Support Through Utils Component . . . . . 34
7.1.2.2 Case Studies of Successful Integration . . . . . . . . . . 35
7.1.3 User Experience and Interface Usability . . . . . . . . . . . . . . . 35
7.1.3.1 Usability and Presentation . . . . . . . . . . . . . . . . . 36
7.1.3.2 Target Users . . . . . . . . . . . . . . . . . . . . . . . . 36
7.1.3.3 Ease of Creating Simple Metrics through the API . . . . 37
7.1.4 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . 38
7.1.5 Future Directions and Improvements . . . . . . . . . . . . . . . . 39
7.1.5.1 Technical Improvements . . . . . . . . . . . . . . . . . . 39
7.1.5.2 New Features and Functionalities . . . . . . . . . . . . . 39
7.1.5.3 Broader Blockchain Support . . . . . . . . . . . . . . . . 40
7.2 Illicit Bitcoin Address Detection . . . . . . . . . . . . . . . . . . . . . . . 40

iv
7.2.1 Created Dataset Details . . . . . . . . . . . . . . . . . . . . . . . 40
7.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2.4 Model Training and Results Analysis . . . . . . . . . . . . . . . . 44
7.2.4.1 Results Overview . . . . . . . . . . . . . . . . . . . . . . 44
7.2.4.2 Analysis and Interpretation . . . . . . . . . . . . . . . . 44

8 CONCLUSION 46

v
LIST OF FIGURES

6.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


6.2 Modified TabNet Architecture . . . . . . . . . . . . . . . . . . . . . . . . 30

7.1 Metric Chart with filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 36


7.2 Creating Simple Metrics from the Dashboard . . . . . . . . . . . . . . . . 37
7.3 Null data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vi
LIST OF TABLES

6.1 Table of the general metrics . . . . . . . . . . . . . . . . . . . . . . . . . 15


6.2 Table of the Avalanche-Specific metrics . . . . . . . . . . . . . . . . . . . 18

7.1 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

vii
Chapter 1
INTRODUCTION

In the digital age, decentralized networks have emerged as groundbreaking innovations,


revolutionizing data storage, transfer, and validation [1]. At the heart of this transfor-
mation lies blockchain technology, a decentralized ledger system celebrated for its trans-
parency, security, and data immutability. While pioneering networks like Bitcoin, which
introduced a decentralized peer-to-peer electronic cash system [2], and Avalanche, known
for its scalable architecture [3], have set benchmarks, they also underscore the pressing
need for advanced tools to delve into on-chain data, ensuring transactional integrity and
detecting anomalies. Blockchain analysis can be broadly categorized into two types: de-
scriptive and pre- dictive. Descriptive analysis focuses on providing a detailed overview
of transactions and activities within a blockchain, while predictive analysis aims to fore-
cast potential trends and anomalies based on existing data. Our project ambitiously
seeks to cover both these areas, offering a holistic approach to blockchain analysis. On-
chain analysis, a vital component of descriptive analysis in the blockchain ecosys- tem,
offers insights into transactional patterns and participant behaviors, ensuring the secu-
rity and integrity of networks [4]. A robust on-chain analysis framework is essential for
interpreting and visualizing this vast data, identifying irregularities, and facilitating in-
formed decisions. Tools such as IntoTheBlock, Glassnode Studio, Nansen, and Ether-
scan have already made significant strides in this domain, offering in-depth analytics for
various blockchain networks [1] [5]. As blockchain finds applications in diverse sectors
like healthcare and supply chains, the importance of robust on-chain analysis frameworks
becomes even more pronounced [1] [6]. The convergence of open-source applications,
Software as a Service (SaaS) platforms, and extendable frameworks in modern software
development offers transformative poten- tial [7] [8] [9]. Such frameworks, designed for
extensibility, allow for the seamless integration of diverse functionalities, ensuring adapt-
ability to evolving needs [10]. The customiza- tion capabilities within these SaaS solutions
empower businesses to tailor applications to their specific requirements, enhancing user
experiences and operational efficiency [11]. However, this customization also presents
challenges, including increased maintenance complexity [11]. Our project aims to bridge
this gap by creating an open-source SaaS extendable framework for on-chain descriptive
analysis. This framework, distinct in its transparency
and adaptability, ensures easy extensibility, fostering a community-driven enhance-
ment process. Our platform’s open-source nature promotes trustworthiness, offering users

1
a transparent view of its inner workings, thereby fostering a collaborative ecosystem for
on-chain analysis. For the predictive analysis aspect, we delve into anomaly detection
within blockchain data, particularly emphasizing Bitcoin due to its global recognition
and established na- ture. Traditional methods have employed unsupervised learning tech-
niques for detect- ing irregularities in Bitcoin transactions [12]. However, our approach is
anchored on the BABD-13 dataset, which provides 13 types of labels for Bitcoin addresses
[13]. Our strat- egy aims to harness statistical methodologies, focusing on temporal pat-
terns and trans- actional values for multi-class classification. In essence, our project is a
testament to our commitment to advancing the blockchain ecosystem comprehensively.
As networks like Avalanche and Bitcoin continue to grow, our tools aim to evolve along-
side, offering users a full spectrum of insights into on-chain data and anomalies.

2
Chapter 2
PROBLEM STATEMENT

2.1 PHASE 1: ON-CHAIN ANALYTICS PLATFORM

Blockchain is a public ledger—a vast repository of transactional data generated each


moment. Avalanche and Bitcoin, being among the most active networks, contribute
significantly to this data deluge. While the data is freely accessible, extracting meaningful
insights from it remains a formidable challenge. Current platforms either lack depth, fail
to provide a comprehensive view, or are too specialized, catering to a niche audience. For
the layman, trader, or enterprise, there’s a palpable absence of a unified, user-friendly
platform offering exhaustive on-chain analytics.
The complexities don’t just stem from the volume of data but also its diversity. Trans-
actions, smart contract interactions, token transfers, consensus events—each data type
requires specialized handling, and existing tools often fall short in providing a seamless
integration of these diverse datasets. The challenge amplifies when considering scalability
and adaptability. How can we ensure that a platform remains relevant amidst the rapid
evolution of blockchain protocols and standards? How can it adapt to future networks
or integrate advancements in data analytics?

2.2 PHASE 2: ANOMALY DETECTION SYSTEM

As Avalanche and Bitcoin networks swell in transactional volume, they inevitably become
targets for malicious actors. Traditional financial systems have established anomaly de-
tection mechanisms. However, the decentralized, global nature of blockchains brings
unique challenges. While analytics can offer insights, they don’t inherently guarantee the
authenticity of transactions.
Identifying irregularities or anomalous patterns amid the sea of legitimate transac-
tions is paramount. However, without specialized tools, this task is akin to navigating a
maze blindfolded. Current mechanisms either generate too many false positives or miss
subtle, sophisticated anomalies—both scenarios being undesirable. The need is for an
intelligent, learning system that evolves with the network, ensuring threats are identified
and addressed promptly.

3
Chapter 3
MOTIVATION

The blockchain landscape, led by pioneers like Avalanche and Bitcoin, is akin to a vast
ocean, teeming with data that holds the promise of transformative insights. As businesses,
researchers, and enthusiasts sail these waters, they often seek navigational aids - tools
that can help them decipher this data, transforming it into actionable intelligence.
At the forefront of this quest is the need for a robust on-chain analysis platform. How-
ever, the dynamic nature of the blockchain world demands more than just a static tool.
It calls for a framework - an adaptable foundation upon which diverse analytical tools
can be built, modified, and enhanced. Such a platform’s power isn’t just in its current
capabilities but in its potential to evolve, grow, and adapt. This is where extendability
emerges as a pivotal feature.
Consider the burgeoning realm of carbon trading, where blockchain can play a pivotal
role in tracking, verifying, and trading carbon credits. As this domain expands, new
metrics, insights, and data sources will emerge. An extendable on-chain analysis platform
can seamlessly integrate these new elements, ensuring it remains relevant and valuable.
But the power of extendability is truly unlocked when coupled with the principle
of open-source. By being open-source, the platform invites collaboration, innovation,
and enhancement from the global community. It’s not just a tool but a living entity,
continuously refining itself through collective wisdom.
Beyond the Avalanche ecosystem, the vast Bitcoin network holds its own set of chal-
lenges. With millions of addresses conducting transactions, there’s a pressing need to
understand the nature of these addresses. This is not just about transactional data but
about classifying addresses based on their behavior patterns. The aim? To develop an
advanced Bitcoin address classification model. Such a model can offer insights into ad-
dress categories, their transactional behaviors, and more, adding another layer of depth
to on-chain analysis.
In essence, our motivation is dual-pronged: to craft an extendable, open-source on-
chain analysis framework tailored for Avalanche and to delve deep into the Bitcoin net-
work, classifying its myriad addresses. Both endeavors, though distinct, resonate with
a shared vision - harnessing the power of blockchain data, making it accessible, under-
standable, and above all, insightful.

4
Chapter 4
LITERATURE REVIEW

4.1 DATA EXTRACTION FROM BLOCKCHAIN NETWORKS

Data extraction from blockchain networks, especially from leading cryptocurrencies such
as Bitcoin, Ethereum, and Avalanche, has become a pivotal research area. The transpar-
ent and immutable nature of blockchain ledgers offers a plethora of data. When effectively
extracted, this data can provide invaluable insights into transaction patterns, anoma-
lies, and potential illicit activities. In the study titled "DataEther: Data Exploration
Framework For Ethereum," the researchers introduced a systematic and high-fidelity
data exploration framework for Ethereum. They exploited its internal mechanisms and
instrumented an Ethereum full node [14]. The proposed method, DataEther, acquires
all blocks, transactions, execution traces, and smart contracts from Ethereum. This ap-
proach overcomes the limitations of existing methods, such as incomplete data, confusing
information, and inefficiency. According to the authors, several recent studies have made
intriguing observations by examining Ethereum data. However, these studies have certain
limitations in their methodologies for data acquisition. These methods can be broadly
categorized into four types:

1. Downloading and parsing block files,

2. Utilizing Web3 APIs provided by Ethereum,

3. Crawling blockchain explorer websites,

4. Instrumenting Ethereum node.

Each of these methods has its own set of constraints and may not provide a com-
prehensive view of the Ethereum ecosystem. Another research titled "Understanding
Ethereum via Graph Analysis" focused on collecting all accounts and transactions from
the launch of Ethereum until November 1, 2018 [15]. They emphasized the complexi-
ties involved in extracting multifaceted transaction data from blockchain networks. The
researchers utilized a dual-pronged data extraction strategy, synchronizing all histori-
cal transaction data using the Ethereum client and concurrently tapping into Etherscan
APIs. In "Predicting Bitcoin Returns Using High-Dimensional Technical Indicators," the
authors used a BTC-USD dataset from investing.com, which included daily open, high,
low, and close prices of bitcoin from January 1st, 2012 to December 29th, 2017 [16].

5
They divided the dataset into training and test samples, emphasizing the importance
of robust data extraction methodologies. The paper "An On-Chain Analysis-Based Ap-
proach to Predict Ethereum Prices" gathered data from the public Ethereum blockchain
and online resources’ APIs from 2016 through 2021 [17]. They analyzed metrics with the
Ethereum price using on-chain data, aiming to provide a broader overview by incorporat-
ing on-chain metrics relating to miners, users, and exchange activity and their possible
impact on Ethereum pricing. The study "Anomaly Detection Model Over Blockchain
Electronic Transactions" revolved around the extraction of Bitcoin transaction data from
https://www.blockchain.com/charts [18]. The research "Transaction-based classification
and detection approach for Ethereum smart contract" showcased a dual-pronged data
extraction strategy [19]. They synchronized all historical transaction data using the
Ethereum client and concurrently tapped into Etherscan APIs. This dual method ensured
the capture of a vast spectrum of data, revealing the multifaceted nature of Ethereum’s
smart contract transactions. In conclusion, data extraction from blockchain networks,
especially from Bitcoin, Ethereum, and Avalanche, has seen varied methodologies. From
using public APIs to setting up dedicated nodes, researchers have adopted multiple strate-
gies to ensure comprehensive and accurate data retrieval.

4.2 ON-CHAIN ANALYSIS METRICS

On-chain analysis has become a cornerstone in cryptocurrency research, offering a holistic


view of transactions and patterns within blockchain networks. This method of analysis
delves into the data stored within a blockchain, providing insights that can be instru-
mental in predicting the future movements of cryptocurrency prices. In "An On-Chain
Analysis-Based Approach to Predict Ethereum Prices," the exploration of various on-
chain metrics to predict Ethereum prices was undertaken [20]. Metrics such as total
gas used, miner revenue, hash rate, transaction count, active addresses, and block size
were leveraged. By juxtaposing these metrics against normalized Ethereum prices, cor-
relations were discerned. Further, the relationship between these metrics and Ethereum
prices was scrutinized using Pearson and Spearman correlation coefficients. Another re-
search presented at the International Conference on Blockchain and Cryptocurrencies
employed a set of six on-chain time series metrics sourced from Glassnode [21]. The
metrics encompassed new addresses, active addresses, block height, fees, hash rate, and
the Spent Output Profit Ratio (SOPR). Stochastic processes and deep learning mod-
els were juxtaposed to forecast the values of these metrics, emphasizing the efficacy of
this approach for statistical hedging and long-term asset allocation in the cryptocurrency
domain. In "Blockchain-based Cryptocurrency Price Prediction with Chaos Theory, On-
chain Analysis, Sentiment Analysis and Fundamental-Technical Analysis," methods used
for cryptocurrency price prediction were categorized [22]. Under the umbrella of on-chain
analysis, the essence was delineated as the examination of data within a blockchain net-
work. Such an analysis offers a panoramic view of all transactions, given their permanent
imprint on the blockchain. The data for on-chain analysis is broadly segmented into

6
transaction data, block data, and smart contract code. The historical significance
of on- chain analysis, tracing its roots back to 2011, was also highlighted. Several key
indicators, such as the Network Value-to-Transaction (NVT) ratio and the UTXO (un-
spent transac- tion outputs), were introduced, which have been instrumental in gauging
the value and movement of cryptocurrencies. In conclusion, on-chain analysis metrics
have proven to be invaluable tools in the cryptocurrency research landscape. By offering
a comprehen- sive view of transactions and patterns within blockchain networks, these
metrics provide researchers and investors with insights that can guide decision-making
and predict future market movements.

4.3 ON-CHAIN ANALYSIS PLATFORMS/APPLICATIONS

Nansen: Nansen is a versatile platform that supports multiple blockchains including


Ethereum, Polygon, BNB Chain, Avalanche, Fantom, Ronin, Celo, and Terra 2.0. One
of its standout features is the provision of real-time on-chain data. Additionally, it offers
wallet labeling, which provides a deeper insight into on-chain activity. Nansen also boasts
multichain dashboards and a unique feature named "DeFi Paradise". Notably, it provides
support for Avalanche [23].
Dune: Dune Analytics is primarily focused on Ethereum, Polygon, BNB Chain,
Solana, Arbitrum, and Avalanche. It differentiates itself with a community-driven ap-
proach to data. Users can query the blockchain using SQL, browse trending projects,
and even explore dashboards created by their favorite creators. Dune also offers support
for Avalanche [24].
Glassnode: Glassnode provides a comprehensive analysis of Bitcoin, Ethereum, Lite-
coin, and various stable coins and tokens. The platform offers market data, on-chain data,
network data, and alerts. Users also have access to an API, customizable dashboards,
insights, and reports. However, it does not currently support Avalanche [25].
Intotheblock: Intotheblock supports a wide range of blockchains including Bitcoin,
Ethereum, Bitcoin Cash, Litecoin, EOS, and more. Similar to Glassnode, it offers market
data, on-chain data, network data, alerts, API access, customizable dashboards, insights,
and reports. A notable feature is its support for Avalanche [26].

4.4 BLOCKCHAIN ANOMALY ANALYSIS

Detecting anomalies is vital in uncovering suspicious or fraudulent behavior in data that


diverges from expected patterns. This process is especially critical in blockchain transac-
tions due to the semi-anonymous nature of cryptocurrency addresses. In this review, we
will explore different methods used in anomaly detection, such as unsupervised learning,
graph-based models, statistical techniques, and advanced neural networks like 1D CNN
and TabNet. We’ll discuss past research that forms the foundation of our approach, with
a particular focus on its relevance in the blockchain context.

7
4.4.1 ANOMALY DETECTION WITH UNSUPERVISED LEARNING

Unsupervised learning algorithms excel at discovering hidden patterns in transactional


data without needing pre-labeled examples. In [27], Aung and colleagues used autoen-
coders to detect anomalies in Bitcoin transactions, learning the typical behavior patterns
to identify deviations that signal potential fraud. Le et al. [28] took a different approach,
using clustering methods to spot malicious activity in Ethereum smart contracts. Akcora
et al. [29] added another layer by employing graph-based heuristics to pinpoint suspicious
behavior across blockchain networks. These techniques are precious for detecting illicit
activity patterns in cases where labeled data is unavailable.

4.4.2 GRAPH-BASED APPROACHES FOR ANOMALY DETECTION

Graph-based models are particularly adept at analyzing the complex relationships be-
tween entities in blockchain networks. Xiang et al. [13] introduced the Bitcoin Address
Behavior Dataset (BABD-13), which classifies Bitcoin addresses as either illicit or non-
illicit based on transaction patterns. They created a graph representation of Bitcoin
transactions, extracting features like node degree and clustering coefficient to distinguish
different types of behavior. This dataset includes 13 categories, each representing a par-
ticular type of crime, offering a comprehensive resource to understand the behavioral
traits of different addresses.
Akcora et al. [29] developed heuristic graph-based models to pinpoint high-risk Bit-
coin network transactions by combining structural and temporal features. They utilized
temporal patterns to track how ransomware-related activities occur on the blockchain.
Their innovative method, aptly called "Bitcoin Heist," distinguished ransomware trans-
actions from ordinary ones, providing a unique framework for detecting anomalies.
Weber et al. [30] explored graph embeddings to classify Bitcoin transactions, testing
graph convolutional networks to capture the structural properties of transaction graphs.
Their method used embedding techniques to develop features representing the graph
structure, providing a more sophisticated analysis of transaction networks. Their re-
search significantly improved financial forensics by enhancing anti-money laundering ef-
forts through the identification of suspicious Bitcoin activities.
Akcora et al. [29] developed heuristic graph-based models to pinpoint high-risk Bit-
coin network transactions by combining structural and temporal features. They utilized
temporal patterns to track how ransomware-related activities occur on the blockchain.
Their innovative method, aptly called "Bitcoin Heist," distinguished ransomware trans-
actions from ordinary ones, providing a unique framework for detecting anomalies.
Weber et al. [30] explored graph embeddings to classify Bitcoin transactions, testing
graph convolutional networks to capture the structural properties of transaction graphs.
Their method used embedding techniques to develop features representing the graph
structure, providing a more sophisticated analysis of transaction networks. Their re-
search significantly improved financial forensics by enhancing anti-money laundering ef-

8
forts through the identification of suspicious Bitcoin activities.

4.4.3 STATISTICAL APPROACHES

Statistical analysis plays a vital role in detecting irregularities in blockchain transaction


data. Maheshwari et al. [31] used statistical tests to examine Bitcoin transactions, ana-
lyzing time intervals between consecutive transactions to uncover patterns indicative of
illicit activity. They used the BABD-13 dataset to provide the labeled data needed for
the model. Venter et al. [32] applied statistical machine-learning models to Bitcoin trans-
action logs, helping identify patterns that signal potential anomalies. Similarly, Zhuang
et al. [33] used statistical features to predict and classify fraudulent cryptocurrency
accounts.

4.4.4 TREE-BASED MODELS

Tree-based models like Random Forest and XGBoost are highly effective classification
tools for anomaly detection. Bahnsen et al. [34] utilized Random Forest models to detect
fraud in Bitcoin transactions. Gai et al. [35] employed XGBoost models, using historical
transaction data to identify anomalies in blockchain networks. Conti et al. [36] showcased
the efficacy of ensemble models in identifying fraudulent behavior, demonstrating how
these methods can enhance anomaly detection.

4.4.5 NEURAL NETWORKS IN TABULAR MODELS

Neural networks like Convolutional Neural Networks (CNNs) are highly effective in de-
tecting anomalies in tabular data, particularly in blockchain applications. Baosenguo’s
1D CNN model [37] performed impressively in a Kaggle competition. The model utilized
a fully connected layer to identify locality patterns in tabular data and was followed by
several 1D convolutional layers with shortcut-like connections. Despite tabular datasets
typically lacking locality features, the model excelled in classification by efficiently using
convolutional filters.
Arik and Pfister [38] designed TabNet, a groundbreaking attention-based neural net-
work model tailored specifically for tabular data. TabNet employs feature selection mech-
anisms via sequential attention steps, enabling the model to identify and focus on the
most relevant features while maintaining interpretability. This feature selection improves
predictive accuracy and provides transparency into which features matter most.
Joseph and Raj [39] introduced GATE (Gated Additive Tree Ensemble), which blends
decision trees and neural networks for superior classification of tabular data. GATE
combines tree ensemble learning and neural networks within a gated framework, boosting
classification accuracy by using a gating mechanism to filter out less informative features.
This hybrid approach enhances performance by learning intricate patterns often missed
by traditional tree models.

9
Popov et al. [40] proposed the Neural Oblivious Decision Ensemble (NODE), a deep
learning model crafted for tabular data. NODE integrates decision trees and neural
networks to capture hierarchical feature interactions while retaining the interpretability
of decision trees. The model employs a series of differentiable oblivious decision trees
that efficiently detect complex patterns in tabular datasets, resulting in high anomaly
detection performance.

10
Chapter 5
RESEARCH OBJECTIVES

5.1 DEVELOPMENT OF AN EXTENDABLE OPEN-SOURCE ON-CHAIN


ANALYSIS FRAMEWORK FOR BLOCKCHAIN NETWORKS

1. Develop a comprehensive analysis framework for blockchain networks and initiate


it with the Avalanche blockchain network, aiming to provide in-depth insights into
its on-chain activities and trends.

2. Design a user-centric platform that offers a suite of tools and features to facili-
tate detailed on-chain analysis of blockchains, enhancing users’ ability to derive
meaningful interpretations and make informed decisions.

3. Adapt and extend the established analytical framework to seamlessly cater to an-
other blockchain network, (Bitcoin blockchain network), ensuring that it captures,
processes, and visualizes Bitcoin-specific on-chain data with the same depth and
clarity.

5.2 CREATION OF AN ACCURATE ANOMALY DETECTION MODEL


FOR BLOCKCHAIN TRANSACTIONS

1. To design and train a precise anomaly detection model for identifying irregularities
in blockchain transactions.

2. To utilize supervised learning techniques for training an anomaly detection model


for tabular-based Bitcoin addresses, ensuring high accuracy and reliability.

11
Chapter 6
METHODOLOGY

6.1 OPEN-SOURCE SAAS PLATFORM

6.1.1 METRICS RECOGNITION

Embarking on our journey to develop an open-source, extendible on-chain analysis frame-


work for blockchains, our team delved into the heart of blockchain’s complexity with a
clear objective: to unlock the myriad stories embedded within its data. Recognizing the
immense potential and the challenges of interpreting blockchain activities, we identified
"Metrics Recognition" as a cornerstone of our research. This pivotal area not only guides
our Analysis framework but also ensures that the insights we derive are both actionable
and profound.
Understanding the multifaceted nature of blockchain data, we embarked on crafting
a comprehensive approach that systematically categorizes and formulates metrics. Our
endeavor was driven by the conviction that to truly grasp the essence of blockchain
dynamics, we need to transcend traditional analysis methods. By segmenting our focus
into two primary areas—Metric Categories and Metric Formulation—we aimed to lay
a solid foundation for our Analysis tool, ensuring it not only meets the current needs
of blockchain analysis but is also poised to adapt and expand alongside the blockchain
ecosystem itself.

6.1.1.1 Metric Categorization

The Analysis exploration of blockchain technology through on-chain metrics unveils the
operational dynamics, economic activities, security vulnerabilities, and user behaviors
embedded within. This analysis is paramount for a spectrum of stakeholders, from de-
velopers to investors, empowering them with the data to make informed decisions. Rec-
ognizing the intricate and diverse nature of blockchain networks, our team embarked
on categorizing on-chain metrics into distinct areas. This decision was inspired by a
comprehensive review of existing research and methodologies that highlighted the im-
portance of a structured Analysis framework for nuanced blockchain exploration. Such
an approach not only facilitates targeted analysis but also enhances interpretability and
supports the extensibility of our on-chain analysis tool, ensuring its relevance and utility
in an ever-evolving digital asset landscape.
The rationale behind this structured categorization stems from the need to distill ac-

12
tionable insights from the vast, complex data inherent in blockchain networks. As we
delve into the specifics of each category, it’s important to understand that this method-
ological choice is not arbitrary but rather a strategic effort to align our analysis with the
multifaceted nature of blockchain data. By doing so, we aim to equip users with a com-
prehensive toolkit for dissecting blockchain activities, fostering a deeper understanding
of the underlying trends and patterns that govern these digital ecosystems.
We decided to organize on-chain data into different groups because blockchain net-
works are very complicated and have many different parts. We noticed that there are
lots of different types of data on blockchains, and each type can tell us something impor-
tant, aligning with methodologies seen in foundational research such as [41] [42]. So, we
grouped these types of data into categories to make it easier to understand and analyze.
This helps us study them better and make our on-chain analysis tool even better in the
future.
Given this backdrop, the formulation of metric categories becomes a pivotal element
in our Analysis arsenal. It’s through this lens that we can sift through the blockchain’s
data-rich environment, identifying and classifying metrics that are most indicative of
the network’s health, performance, and intricacies. The following discussion provides a
detailed justification for each metric category we’ve established, elucidating the pivotal
role these categories play in enhancing our understanding of blockchain networks.

1. Transactional Metrics: The heart of any blockchain network beats with the
rhythm of its transactions. Recognizing the fundamental role that transactions
play in reflecting the network’s economic pulse, we prioritize transactional metrics
as a key category. Drawing inspiration from studies like [43], which illuminate the
economic significance of transaction patterns, our focus on transactional metrics
aims to uncover the liquidity flows, monetary dynamics, and economic vitality of the
blockchain. These metrics serve as a critical barometer for assessing the network’s
financial health and activity levels, providing essential insights for economic analysis
and strategic decision-making.

2. Network Health and Activity: Inspired by [44], we recognize that network


health and user activity are pivotal for evaluating the blockchain’s sustainability
and adoption. Metrics under this category address the blockchain ecosystem’s re-
silience, growth trends, and user engagement levels. This information is crucial
for developers, investors, and network participants aiming to gauge the network’s
long-term viability.

3. Whale Watching: The concept of monitoring large holders, or "whales," is jus-


tified by research like [45], highlighting the impact of significant stakeholders on
market dynamics. These metrics provide insights into market manipulation risks
and sentiment analysis, crucial for risk management and investment strategy for-
mulation.

13
The categorical classification of metrics not only facilitates a comprehensive analysis
of blockchain networks but also lays a foundation for the extendibility of our analysis
platform. By structuring metrics into well-defined categories, we enable the platform to
adapt to emerging trends and innovations within the blockchain space. This modular
approach allows for the seamless integration of new metrics and categories, ensuring the
platform remains relevant and valuable to users as blockchain technology evolves [46].
Moreover, the categorization supports customized analysis tailored to specific user
needs and interests. Whether focusing on economic analysis, security assessment, or
network growth, users can leverage relevant metric categories to derive targeted insights.
This flexibility underscores our platform’s capability to serve a diverse user base, from
investors and developers to researchers and regulatory bodies [47].
In conclusion, the strategic categorization of on-chain metrics into distinct areas is
a deliberate and methodologically sound approach designed to enhance the depth, clar-
ity, and utility of blockchain analysis. Drawing upon existing literature and industry
best practices [48], we justify each category’s inclusion and articulate its significance in
providing a rich, multidimensional view of blockchain ecosystems. This framework not
only facilitates current Analysis needs but also ensures our platform’s adaptability and
relevance in the face of future blockchain developments.

6.1.1.2 Metrics Formulation

Now that we have established, the metric categorization thoroughly, let’s walk through the
metric formulation. It is important to note that we have tailored our metric formulation
under two categories as well. They are,

1. General Metrics: Metrics that are compatible with any blockchain network.

2. Avalanche-Specific Metrics: Metrics that are specific to Avalanche.

The choice of the general metrics has been influenced by the literature in a way
such that, the features on which those general metrics have been created are found in
any blockchain network. For instance, [41] and [43] underscore the importance of on-
chain metrics for predictive analysis and reinforcement learning systems. Based on those,
we have identified the following general metrics. Moreover, we have been careful to
identify those metrics under the categories which we has identified under the Metric
Categorization section. Following is the table of the general metrics, where each general
metric is presented with its definition, calculation method, significance, and category.

14
Table 6.1: Table of the general metrics

Category Metric Name Metric Definition Calculation


Method

Transactional Active Count of unique This method tallies


Metrics Addresses addresses with all distinct addresses
transactions in a day. engaging in
transactions within a
24-hour period.

Active Senders Number of unique This approach


addresses that have identifies and sums all
sent transactions in a unique addresses
day. initiating transactions
during the specified
day.

Transaction Average transaction For each time


Frequency counts are processed interval, the total
in various time number of daily
intervals (second, transactions is either
hour, day). divided by the
interval’s unit count
or summed for the
entire day.

Total The cumulative This calculation


Transactions number of aggregates all
transactions recorded transactions recorded
in a blockchain within the specified
subchain. subchain up to the
present or another
specified cutoff.

Continued on the next page

15
Category Metric Name Metric Definition Calculation
Method

Transaction Sum and mean of The total transaction


Amounts transaction values value for the day is
(Total/Average) processed in a summed, while the
blockchain subchain average transaction
on a specific date. value is derived by
dividing this sum by
the number of
transactions.

Transactions The average number The method averages


Per Block of transactions the day’s total
included in each block transactions over the
on a specific date number of blocks
within a blockchain mined on that day.
subchain.

Network Total Blocks The total number of This metric is


Health and unique blocks mined obtained by counting
Activity within a blockchain each unique block
subchain. that was mined and
recorded within the
subchain on a
specified day.

Whale Large Counts the number of The process involves


Watching Transactions transactions on a identifying and
specified date within counting transactions
a blockchain subchain where the transferred
where the amount value exceeds a
either emitted or specified benchmark,
consumed exceeds a indicating substantial
predetermined transfers.
threshold.

Continued on the next page

16
Category Metric Name Metric Definition Calculation
Method

Whale Address Tracks the number of This measure tracks


Activity trx on a specific date transactions from
within a blockchain addresses where the
subchain where the total transaction
total amount either value (sent or
emitted or consumed received) exceeds a
exceeds a predefined threshold indicative
whale transaction of significant
threshold. stakeholder activity.

Apart from the generalization, the beauty of our framework, GodSight lies in its cus-
tomization. Any individual, consortium, or organization that uses GodSight can tailor
their tool to represent any of the metrics that they identified and found useful. Building
on the foundation of customization and adaptability that characterizes our framework,
GodSight extends the capability of personalized metric development beyond predefined
categories. This unique feature empowers individuals, consortiums, or organizations lever-
aging GodSight to not only tailor metrics within established categories but also to pioneer
new categories that cater to evolving needs and novel insights specific to their blockchain
analysis objectives. This flexibility ensures that GodSight remains at the forefront of on-
chain Analysis, accommodating the dynamic landscape of blockchain technologies and
the diverse Analysis requirements of its users.
As a testament to the framework’s versatility and forward-thinking design, we in-
troduce the concept of Economic Indicators and Cross-Chain Metrics as exemplary new
categories born out of the need to highlight the importance of financial metrics in under-
standing market dynamics and investment potential and to understand and quantify the
complexities of interoperability and asset flow across different blockchain networks. These
categories emerge in response to the growing importance of interconnected blockchain
ecosystems, highlighting GodSight’s ability to adapt and innovate in alignment with the
latest trends and technological advancements in the blockchain space.

1. Economic Indicators: The inclusion of economic indicators as a separate cate-


gory draws upon [43] [45], which underscores the importance of financial metrics in
understanding market dynamics and investment potential. These metrics provide
a macroeconomic perspective of the blockchain, offering insights into market capi-
talization, trading volumes, and token distribution, essential for economic analysis
and forecasting.

2. Cross-Chain Metrics: The Cross-Chain metrics serve as a crucial Analysis lens


through which the efficiency, security, and economic implications of cross-chain ac-
tivities are examined. This category encompasses metrics designed to evaluate the

17
interoperability between blockchains, offering insights into the liquidity movement,
transactional coherence, and overall impact of these interactions on the broader dig-
ital asset market. By enabling the creation of such tailored categories and metrics,
GodSight not only enhances the depth and breadth of blockchain analysis but also
empowers users to navigate and exploit the intricacies of a multi-chain world.

This capability of GodSight to facilitate the introduction of new, customized met-


ric categories like Economic Indicators and Cross-Chain Metrics underscores our com-
mitment to innovation and user empowerment. It reflects our understanding that the
blockchain ecosystem is continually evolving, and so are the Analysis needs of those who
seek to explore its depths. Through GodSight, we provide a platform not just for analysis
but for innovation, enabling our users to define the cutting edge of on-chain Analysis.
In the spirit of demonstrating the unparalleled adaptability and user-centric design of
GodSight, we now present a curated collection of tailored metrics specifically developed
for Avalanche. These metrics have been meticulously designed to harness the unique
characteristics and capabilities of the Avalanche blockchain, offering insights that are both
profound and actionable. The following table encapsulates these metrics, organized into
their respective categories, and provides a detailed overview of each metric’s definition
and calculation method.

Table 6.2: Table of the Avalanche-Specific metrics

Category Metric Name Metric Definition Calculation


Method

Transactional Emitted UTXO Total, mean, and The total value is


Metrics Amounts median values of calculated by
(Sum/Average/- Unspent Transaction summing all emitted
Median) Outputs (UTXOs) UTXOs. The average
created on a specified is derived by dividing
date within a this sum by the day’s
blockchain subchain. transactions. The
median is identified
by arranging all
UTXO amounts and
selecting the middle
value.

Continued on the next page

18
Category Metric Name Metric Definition Calculation
Method

Consumed Total, mean, and The total value of


UTXO Amounts median values of consumed UTXOs is
(Sum/Average/- Unspent Transaction summed. The average
Median) Outputs (UTXOs) is calculated by
spent on a specified dividing this total by
date within a the day’s
blockchain subchain. transactions. The
median is found by
arranging consumed
UTXO values and
selecting the middle
one.

Network Total Staked The sum of all tokens The total staked
Health and Amount staked on a specified amount is obtained
Activity date within a by summing the value
blockchain subchain. of all tokens staked
within the subchain
on the designated
day.

Economic Total Burned The sum of all tokens Totals the amount of
Indicators Amount intentionally tokens that have been
destroyed or removed burned or removed
from circulation on a from circulation on
specified date within the specified day.
a blockchain
subchain.

Network Measures the The efficiency is


Economic efficiency of the calculated by getting
Efficiency network’s economy by the ratio of the total
comparing the total transacted value with
value transacted to the total burned
the total amount of amount for the day.
tokens burned on a
specific date.

Continued on the next page

19
Category Metric Name Metric Definition Calculation
Method

Staking Evaluate the overall Assesses staking


Dynamics Index efficiency and appeal efficiency by factoring
of the staking process in staked amounts,
on the Avalanche estimated rewards,
network, factoring in burned amounts, and
the rewards, fees, and delegation fee
staked amounts. percentages.

Staking Compare the Calculates the ratio


Engagement estimated rewards to of total estimated
Index the total amount rewards to the total
staked on the amount staked on the
Avalanche network for network for the day.
a specified date,
reflecting the
attractiveness of
staking rewards.

Cross-Chain Interchain The ratio of total The ratio is derived


Metrics Liquidity Ratio value transferred from comparing the
between chains to the total value transferred
total value created on across chains to the
the Avalanche total value created on
network on a specified Avalanche for the day.
date.

Interchain A measure of the Calculated as the


Transactional proportion of ratio of the total
Coherence cross-chain trx value of cross-chain
relative to the total transactions to the
trx volume on the overall transaction
network for a volume on the
specified date, network for the day.
focusing on
interoperability and
cross-chain activity.

Concluding the Metrics Recognition section, it is clear that the development and im-
plementation of a structured framework for on-chain metric analysis significantly enhance
our ability to understand and interpret the complex dynamics of blockchain networks.

20
Through the meticulous categorization and formulation of metrics, our framework —
GodSight — embodies a holistic approach to blockchain Analysis, merging the rigor of
scientific analysis with the adaptability required to navigate the evolving digital asset
ecosystem.
By distinguishing between general metrics and those tailored specifically to the Avalanche
blockchain, we underscore the versatility and depth of our analysis capabilities. The intro-
duction of categories such as Economic Indicators and Cross-Chain Metrics further exem-
plifies our commitment to innovation, ensuring that GodSight remains at the forefront of
on-chain analysis by addressing emerging trends and the growing need for interoperability
among disparate blockchain systems.
Our framework’s emphasis on customization and extensibility not only caters to the
immediate Analysis needs of various stakeholders, from developers to investors, but also
anticipates the future demands of the blockchain community. By enabling users to define
and integrate new metrics and categories, GodSight fosters a collaborative and forward-
looking approach to blockchain Analysis, empowering users to uncover insights that are
both profound and actionable.
In conclusion, the Metrics Recognition section of our research delineates a founda-
tional aspect of our work, laying the groundwork for advanced on-chain analysis. The
structured categorization and thoughtful formulation of metrics serve as the cornerstone
of our framework, enabling a nuanced exploration of blockchain networks that is both
comprehensive and adaptable. As the blockchain landscape continues to evolve, so too
will our framework, ensuring its relevance and utility for years to come. This adaptability,
rooted in a deep understanding of blockchain metrics and their implications, positions
GodSight as an indispensable tool for navigating the future of blockchain analysis.

6.1.2 SYSTEM DESIGN & ARCHITECTURE

In the rapidly evolving domain of blockchain technology, the ability to adapt and inter-
pret extensive data efficiently is paramount. The GodSight framework, designed with
this challenge in mind, offers a robust solution for on-chain analysis through its sophisti-
cated system architecture and design. This chapter delves into the intricate components
and methodologies that make up the GodSight framework, showcasing how it seamlessly
integrates data extraction, computation, and extensibility. By leveraging a modular ap-
proach, the framework ensures that users can not only keep pace with current blockchain
technologies but also have the capacity to incorporate future advancements. This intro-
duction sets the stage for a detailed exploration of the system’s architecture, highlighting
the extendibility to cater to a diverse set of blockchain environments.

6.1.2.1 Framework Architecture

The GodSight framework is engineered as a modular assembly of components, each ded-


icated to handling a specific task within the ecosystem. This component-based architec-
ture enhances maintainability, scalability, and the ease of integrating new functionalities.

21
Below is a breakdown of the primary components:

• Extraction Component: This component is responsible for the retrieval of blockchain


data. It leverages user-defined API functions to extract transaction data, which is
then processed and formatted for consistency across different blockchains.

• Computation Component: Central to processing and analyzing the data, this


component executes complex algorithms and computations needed to derive insights
from the extracted blockchain data.

• Metrics Controller Component: Implemented as a Django application, this


API service manages metrics by fetching metric data and allowing users to define
new simple metrics. It acts as a bridge between the raw data processed by the
framework and the insights presented to the users.

• Dashboard Component: A visual interface built using React, this application


displays the results of various metrics. It connects seamlessly to the Metrics Con-
troller, providing an integrated user experience without the need for separate service
setup.

• Utils Component: A toolkit component that facilitates the addition of new


blockchains into the system. It includes utilities for setting up and validating new
data mappers and extraction functions, crucial for expanding the framework’s ca-
pabilities.

Figure 6.1: System Architecture

Except for the Dashboard application, all components of the GodSight framework
are developed using Python, ensuring high interoperability and ease of integration with
other Python-based tools and libraries. Furthermore, through the features provided by
the Utils component, users can generate Docker images for the Extraction and Compu-
tation components. This capability facilitates the deployment of these components as

22
AWS Lambda functions, allowing for scalable, cloud-based operations that can efficiently
handle varying loads and data volumes.
This architecture not only supports the operational demands of on-chain Analysis but
also provides a robust, flexible platform for future expansions and enhancements.

6.1.2.2 Extendibility

The extendibility of the GodSight framework is a core feature, designed to ensure that the
system can adapt and evolve in response to new requirements and blockchain technologies.
This section describes how extendibility is implemented, the components involved, and
the processes for integrating new features.

• Database as a Tool for Extendibility: The framework utilizes a database not


only for storing transaction data but also as a central tool for facilitating extendibil-
ity. The database’s role is pivotal in managing and scaling the integration of new
blockchains, making the system both simple and reliable.

• Role of the Utils Component: The Utils Component is crucial in the extendibil-
ity process. It contains all the necessary functionalities and scripts required to add
new blockchains and metrics types to the system. This component ensures that
new integrations are both seamless and standardized.

• Required Files and Scripts: Users looking to extend the framework to support
a new blockchain must provide specific files and scripts, including:

1. Extraction Function Script: A Python script that defines how to extract data
from the new blockchain.
2. Mapper Scripts: These scripts specify how to map the raw data from the
blockchain into the general format used by the framework.
3. Metric Scripts: These scripts allow users to define the custom metrics in a
detailed manner.

• Formatting and Examples: Each script must adhere to a specific format to en-
sure compatibility with the framework. For example, when integrating the Avalanche
blockchain, users would define an extraction function in Python that outputs data
in three lists: inputs, outputs, and transactions. The mapper scripts would then
format these lists to align with the predefined database schema.

• Validation Process: The validation process involves checking the syntax and
logic of the new scripts. The Utils Component automatically tests these scripts to
ensure they execute without errors and that the data mappings correctly reflect the
general model expectations.

• Database Schema and Storage Process: Each new blockchain integrated into
the framework requires a corresponding set of database tables that adhere to a

23
generalized schema optimized for on-chain Analysis. This schema includes fields
common to most blockchains, such as transaction IDs, dates, and values, but can
also accommodate unique features specific to each blockchain. The process of stor-
ing data involves:

1. Formatting the extracted data using the mapper scripts.


2. Inserting the formatted data into the pre-defined database tables.
3. Using database triggers or scheduled tasks to update the tables as new data
is fetched by the Extraction Component.

By following these structured steps, the GodSight framework ensures that extendibil-
ity is not only feasible but also practical and efficient, allowing users to continually adapt
the system to meet the ever-changing landscape of blockchain technology. This approach
provides a robust foundation for the ongoing expansion and customization of the Analysis
capabilities.

6.1.2.3 Customizability

The customizability within the GodSight framework empowers users to define custom
metrics tailored to their specific Analysis needs. This system is designed to accommodate
the diversity of blockchain technologies by utilizing a common set of features commonly
found across most blockchains. Here’s how the rule-based system operates:

• Metric Definition: Users are required to define their metrics using these stan-
dardized features. Metric functions are articulated through a JSON format, the
specifics of which are detailed in the GodSight documentation. This documenta-
tion provides comprehensive guidelines and examples to assist users in formatting
their metric definitions correctly.

• Integration with Dashboard Application: The feature to define metrics is


embedded within the GodSight Dashboard application. This integration facilitates
a user-friendly interface for entering and managing custom metric definitions.

• API Interaction and Validation: Once a metric is defined, users submit it


via an API request to the Metrics Controller component. The Metrics Controller
performs two levels of validation:

1. Format Validation: Ensures that the metric definition adheres to the JSON
format as specified in the framework documentation.
2. Logic Validation: Assesses the logical structure and feasibility of the defined
metric to guarantee that it can be computed accurately and efficiently.

• Database Storage and Metric Execution: After successful validation, the met-
ric is stored in the database under a unique name provided by the user. The actual

24
computation of these metrics is not automated within the framework; instead, users
must manually trigger the Computation Component. This component then pro-
cesses the data according to the defined metrics and generates the desired outputs.

This customizability of the GodSight framework not only facilitates the customization
of data analysis to suit various blockchain types but also ensures that users can effectively
measure and interpret blockchain activities through a flexible, user-defined metric system.
The system design & architecture of the GodSight framework are foundational to its
capability to deliver precise and scalable on-chain analysis. Through its component-based
structure, the framework ensures comprehensive data handling from extraction to visu-
alization, all the while maintaining flexibility in the integration of new blockchain tech-
nologies. The system’s reliance on a customized approach for metrics definition and the
strategic use of databases for extendibility underscore its innovative design. As blockchain
technology continues to grow in complexity and application, the adaptability and robust-
ness of the GodSight framework equip users with the tools necessary for effective analysis
and decision-making.
This methodology not only outlines the technical underpinnings of the system but
also reflects on the future potential of the GodSight framework to transform blockchain
analysis through continuous refinement and expansion.

6.2 BITCOIN ANOMALY ANALYSIS

6.2.1 BABD-13: DATASET OVERVIEW

The BABD-13 dataset provides comprehensive, labeled Bitcoin transaction data, en-
abling researchers to analyze the behavior of various Bitcoin addresses. Each transaction
is tagged according to its specific nature, with the original dataset consisting of 13 dis-
tinct labels that describe the activities associated with Bitcoin addresses. These labels
range from illicit categories like “Blackmail,” “Darknet Market,” and “Money Launder-
ing” to non-illicit categories like “Cyber-Security Service,” “Centralized Exchange,” and
“Individual Wallet.”
The dataset groups labels into two primary categories: illicit and non-illicit. Illicit la-
bels represent activities such as extortion, money laundering, and other forms of financial
fraud, while non-illicit labels include various legitimate uses, such as financial services
and cryptocurrency mining pools.

Illicit Types

• Blackmail: Involves various scams where victims are coerced or tricked into paying
a certain amount of cryptocurrency to specific addresses.

• Darknet Market: Markets operating on the darknet to trade illegal items using
cryptocurrencies.

25
• Government Criminal Blacklist: Addresses believed to be involved in criminal
activities.

• Money Laundering: The process of trading “dirty” cryptocurrency to obscure its


origins.

• Ponzi Scheme: Fraudulent schemes that reward early investors using newer par-
ticipants’ investments.

• Tumbler: Services used to anonymize cryptocurrency transactions.

Non-Illicit Types

• Cyber-Security Service: Providers offering services like VPNs and payment gate-
ways, accepting only cryptocurrency as payment.

• Centralized Exchange: Intermediaries for trading cryptocurrency and swapping


it with fiat currency.

• P2P Financial Infrastructure Service: Financial activities conducted solely


via cryptocurrency.

• P2P Financial Service: Services that provide rewards in cryptocurrency for


completing tasks.

• Gambling: Casino-style games where cryptocurrencies are used for betting.

• Mining Pool: Groups of miners who collaborate to maximize their chances of


mining new blocks.

• Individual Wallet: Wallets used for day-to-day cryptocurrency transactions by


individuals.

Given the nature of the dataset, a significant challenge lies in the imbalance between
illicit and non-illicit categories. The majority of the dataset is dominated by non-illicit
addresses (over 93% of the data), while illicit activities like government blacklists, Ponzi
schemes, and money laundering make up only a small fraction (∼ 0.003% of the data).
Due to the low representation of some illicit types, those types were excluded from
the final selection. The finalized selected labels are:

• Blackmail - Class 1

• Darknet Market - Class 2

• Tumbler - Class 3

• Non-illicit - Class 0

26
6.2.2 TABULAR-BASED BITCOIN ANOMALY DATASET CREATION

In this research, the goal is to create a simple, tabular-based Bitcoin anomaly dataset,
distinct from the graph-based BABD-13 dataset. The tabular format simplifies the rep-
resentation of Bitcoin transaction data, making it more accessible to on-chain analysis
where transactional data is typically maintained in tables. By using a statistical approach
to feature extraction, patterns can be discerned directly from transactional attributes,
facilitating anomaly detection across different classes.

6.2.2.1 Tabular-Based Dataset Approach

Instead of relying on graph-based modeling, the proposed dataset is structured in a


tabular form where each row represents features related to a specific Bitcoin address.
This approach provides a concise view of transaction data by summarizing key attributes
using various statistical measures. The simplicity of this format ensures that anomaly
detection models can efficiently process and identify unusual behavior patterns.

6.2.2.2 Statistical Feature Extraction

To populate the dataset, statistical features are extracted for each address based on
transaction data. The extracted features include standard measures like:

• Mean: The average transaction value or time difference.

• Median: The middle value of the transaction data.

• Count: The total number of transactions.

• Quantiles: Statistical values at specific percentiles (e.g., 25th, 75th).

• Skewness and Kurtosis: Measures of asymmetry and peakedness, respectively,


of the transaction value distribution.

6.2.2.3 Importance of Transaction Time and Value

Transaction time and value are critical components when analyzing Bitcoin transaction
data. Temporal analysis reveals patterns such as frequent transactions within short inter-
vals, possibly indicative of automated or suspicious activity. Transaction value analysis
highlights deviations that could signal illicit behavior, like sudden, large transfers.

6.2.2.4 Differentiation Between Input and Output Transaction Features

Bitcoin addresses can be involved in transactions either as an input (sending cryptocur-


rency) or as an output (receiving cryptocurrency). The dataset differentiates features
based on this classification, providing a comprehensive view of an address’s transaction
history. Input-specific features may include metrics such as the total number of inputs or

27
the average input value. Similarly, output-specific features cover metrics like the number
of outputs and the average output value.

6.2.2.5 Cross-Featuring Between Input and Output Transactions

In addition to separate input and output features, cross-featuring involves combining both
aspects to create new feature sets. For instance, features like the ratio of input to output
value or the time difference between consecutive transactions regardless of direction can
offer deeper insights into transactional behavior.

6.2.2.6 Overview of Newly Created Features

This methodology yields a total of 76 features, each contributing unique insights into the
behavior patterns of Bitcoin addresses. The features cover a broad range of statistics:

• Statistical measures of input and output transaction values and time.

• Aggregated metrics such as the total transaction count or cumulative value.

• Ratios and combinations of input/output metrics.

This new dataset allows machine learning models to differentiate between different
Bitcoin address classes (illicit vs. non-illicit) with statistical precision, providing a prac-
tical alternative to graph-based anomaly detection approaches.

6.2.3 MACHINE LEARNING WORKFLOW

The machine learning workflow involves several crucial steps to transform the raw Bitcoin
transaction data into an optimized, balanced dataset ready for model training. This
ensures that the subsequent classification models are well-prepared to accurately detect
Bitcoin anomalies.

6.2.3.1 Data Cleaning

The first step is to clean the dataset by removing features with excessive null values.
Columns where a significant proportion of data is missing can lead to bias or inaccuracies
during analysis. By dropping these columns, the dataset becomes more reliable and
representative of relevant features. Any remaining null values are either imputed or
handled based on the specific needs of the analysis.

6.2.3.2 Feature Selection

Feature selection is crucial to reducing overfitting and improving model performance.


Two primary techniques are used:

28
• Recursive Feature Elimination (RFE): This method recursively eliminates fea-
tures by training models with subsets of features and ranking them by importance.
The goal is to retain only the most relevant features.

• Random Forest Feature Importance: Random Forest classifiers provide a mea-


sure of feature importance based on how effectively they split the dataset. Features
with higher importance scores significantly impact the final classification outcome
and are prioritized.

6.2.3.3 Handling Data Imbalance

Due to the overwhelming number of non-illicit transaction records in the dataset, han-
dling data imbalance becomes critical to ensure the model isn’t biased. The following
oversampling techniques are used:

• Synthetic Minority Oversampling Technique (SMOTE): This technique


generates synthetic samples for the underrepresented illicit classes by interpolat-
ing between nearest-neighbor minority samples, effectively increasing the sample
size.

• Random Oversampling: Duplicates records of underrepresented classes to bal-


ance class proportions, ensuring the model is exposed to enough samples from all
classes.

6.2.3.4 Data Standardization

To ensure that all features are on a comparable scale and that none disproportionately
influence the models, standardization is crucial. Each feature is normalized to have a
mean of zero and a standard deviation of one. This makes training more consistent and
ensures that the model can interpret the influence of each feature accurately.

6.2.3.5 Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is employed to reduce the dimensionality of the


dataset. It achieves this by transforming the original feature space into a set of principal
components that capture the most significant variance in the data. By selecting only the
principal components that account for a large proportion of variance, the dataset becomes
more concise, focusing on key patterns while reducing computational requirements and
minimizing redundant information.

6.2.4 MODEL TRAINING AND ARCHITECTURE

In this phase of the methodology, different models are explored to identify the most suit-
able architecture for Bitcoin anomaly detection, given the nature of the tabular dataset
and the specific features extracted.

29
6.2.4.1 Tree-Based Models and Tabular NN Models

Tree-based models and tabular neural networks (NNs) have proven highly effective in
classification tasks involving structured data.

• Tree-Based Models: Decision tree models, particularly Random Forests and Gra-
dient Boosting Machines, are well-suited for handling tabular datasets. They can
identify complex patterns and interactions between features, offering insight into
the importance of each feature for classification.

• Tabular NN Models: Tabular NN models, such as TabNet, GATE, and Node,


are specialized neural networks designed to work directly with tabular data. They
combine the advantages of deep learning, such as feature extraction and high-
dimensional representation, with the interpretability and structured nature of tree-
based models.

6.2.4.2 Modified TabNet Model Architecture

Figure 6.2: Modified TabNet Architecture

In this study, a modified TabNet model is used to exploit its feature selection and
classification capabilities. The TabNet architecture combines:

• Feature Selection Component: Uses attention mechanisms to select and prior-


itize important features.

• Classifier Component: Processes the selected features and outputs classification


probabilities.

The attention mechanism ensures that only relevant features are processed, reducing
unnecessary computation and enhancing the model’s interpretability.

30
6.2.4.3 Introduction of 1D CNN Feature Extraction

To enhance feature extraction, a 1D Convolutional Neural Network (CNN) is added to


the TabNet architecture. The 1D CNN layer processes tabular data sequentially, allowing
for the automatic extraction of complex patterns. This step improves performance by
revealing relationships that aren’t directly apparent in individual features.

6.2.4.4 Feature Selection and Classification Components (Attention Mecha-


nism and Classifier)

• Feature Selection: The feature selection component uses an attention mechanism


to prioritize features that most influence the classification outcome. It ensures that
the downstream classifier only processes the most relevant attributes.

• Classifier: The classifier receives these important features and processes them to
predict the class probabilities. The modular design allows efficient adjustments and
training to improve classification accuracy.

Modifications to Activation Functions


Other modifications were made to the TabNet architecture to improve performance:

• ReLU to PReLU Activation Function: The activation function was changed


from ReLU (Rectified Linear Unit) to PReLU (Parametric Rectified Linear Unit).
PReLU has a learnable parameter that allows for negative activation values, helping
reduce inactive neural nodes and improving gradient flow.

• Additional Layers: Additional layers were integrated into the model to refine
feature selection and classification.

These architectural adjustments improve the model’s ability to detect anomalies across
the highly imbalanced classes in the Bitcoin transaction dataset.

6.2.5 EVALUATION METRICS

To assess the performance of the anomaly detection models, it’s important to use metrics
that effectively capture their ability to handle the extreme class imbalance. The evalua-
tion process involves a highly imbalanced test set, where approximately 99% of the data
consists of non-illicit addresses and only 1% comprises illicit addresses.

6.2.5.1 Imbalanced Test Set Description

The test set is intentionally imbalanced to represent real-world distributions in Bitcoin


transactions. Most transactions are legitimate, with only a tiny fraction representing
illicit activity. This distribution allows for a more realistic evaluation of the model’s
ability to detect the minority illicit classes.

31
6.2.5.2 Metrics

The following metrics are used to measure model performance:

• Overall Accuracy: Overall accuracy measures the proportion of correctly classi-


fied instances out of the total instances in the test set.

TP + TN
Accuracy =
TP + TN + FP + FN

where:

– TP (True Positives): Correctly classified illicit addresses.


– TN (True Negatives): Correctly classified non-illicit addresses.
– FP (False Positives): Misclassified non-illicit addresses as illicit.
– FN (False Negatives): Misclassified illicit addresses as non-illicit.

• Class-Wise Accuracy: Class-wise accuracy, also known as per-class accuracy,


measures accuracy separately for each class to determine how well the model dis-
tinguishes between different types.

Correct Predictions for Class


Class-wise Accuracy =
Total Instances of Class

• F1 Score: The F1 score combines precision and recall into a single metric, pro-
viding a balanced measure for classification performance, especially in imbalanced
datasets.
Precision · Recall
F1 Score = 2 ·
Precision + Recall
where:

– Precision: The proportion of correct illicit classifications out of all predicted


illicit instances.
TP
Precision =
TP + FP
– Recall: The proportion of illicit instances that were correctly identified out
of all true illicit cases.
TP
Recall =
TP + FN

32
Chapter 7
RESULTS AND DISCUSSION

7.1 EXTENDABLE OPEN-SOURCE ON-CHAIN FRAMEWORK

7.1.1 FRAMEWORK PERFORMANCE AND EFFECTIVENESS

The GodSight framework exhibits notable efficiency and reliability in handling the extrac-
tion, computation, and visualization of on-chain data, offering a comprehensive platform
for blockchain analysis.

7.1.1.1 Speed of Data Processing and Accuracy of Data Extraction

GodSight simplifies data processing by mapping all extracted transaction data into a uni-
fied format using a set of general features. This streamlined structure allows the extracted
data to be maintained in a single database table, while categorizing the information into
three key types: inputs, outputs, and transactions. This categorization ensures rich, de-
tailed data is available for on-chain analysis, enabling the framework to support a diverse
array of metrics.
The accuracy of extracted data depends on the APIs used to obtain transaction in-
formation. By default, the framework relies on open-source APIs, which are convenient
and sufficient for general purposes. However, for those seeking higher precision, users
can extract transaction data directly from a blockchain node. This flexibility allows
users to fine-tune their data sources according to their requirements, trading off between
convenience and accuracy.

7.1.1.2 Accuracy in Computation

The computation accuracy within GodSight is directly linked to the quality of the ex-
tracted transaction data and the computation logic used. The framework provides a well-
documented set of equations for each metric and adheres to the general features model
to maintain consistent calculations. However, since the user is responsible for mapping
features during the blockchain integration process, data accuracy can be influenced by
how well this mapping is executed. Additionally, as an open-source tool, users have the
liberty to refine or update the logic of the metrics to align with evolving requirements or
to address issues.

33
7.1.1.3 Responsiveness of the Dashboard

The GodSight dashboard leverages React.js to deliver a responsive, interactive visualiza-


tion pane for on-chain metric data. Charts and graphs provide real-time insights, offering
users a comprehensive view of blockchain activity. The dashboard also includes filters
that allow users to interact with the data dynamically, exploring different perspectives
and gaining a deeper understanding of their metrics. This interactive interface ensures
that users can effortlessly analyze their data and customize their analyses.

7.1.2 EXTENSIBILITY AND CUSTOMIZATION

The GodSight framework is designed with a strong emphasis on extensibility, allowing


users to seamlessly add support for new blockchains and customize their analysis. This
is primarily achieved through the Utils component, which provides a flexible foundation
for defining key files that ensure consistent integration and mapping of new blockchains.

7.1.2.1 Framework Support Through Utils Component

meta.json: The meta.json file is the starting point for adding any new blockchain to the
framework. It includes critical metadata such as the blockchain name, start date (useful
for historical analysis), subchain names, and basic metrics available for each subchain.
This ensures that whether integrating a multi-chain blockchain like Avalanche or a single-
chain blockchain like Bitcoin, the framework accommodates the different structural and
data nuances. For instance, Avalanche’s subchains (X, P, and C) are clearly distinguished
in the meta.json file, while Bitcoin, a single-chain blockchain, defaults to one subchain
called ’default.’
Extract File: The Extract file contains the logic for fetching transaction data from
the source. This ensures users can define custom extraction functions that align with their
preferred data source. For Avalanche, open-source APIs are used to collect transaction
data. By standardizing the input (a specific date) and output (lists of inputs, outputs,
and transactions), the Extract file maintains consistency in data retrieval, making it
easier for users to integrate new blockchains into the framework.
Mapper File: Since different blockchains have varying feature names and structures,
the Mapper file enables mapping from each blockchain’s native features to a general model
format. This ensures that all extracted data is consistent and usable within the frame-
work. For instance, users can directly map or derive feature values from the extracted
data. By providing this mapping flexibility, GodSight can handle the data complexity
across various blockchains.
Metric File: In addition to the meta.json, Extract, and Mapper files, the Metric file
allows users to define custom metrics tailored to their specific on-chain analysis needs.
These custom metrics leverage the general feature set established through the mapping
process, ensuring consistency and compatibility across various blockchains.

34
class CustomMetric:
def __init__(self, blockchain, chain, name, display_name, transaction_type,
category, description):
self.blockchain = blockchain
self.chain = chain
self.name = name
self.display_name = display_name
self.transaction_type = transaction_type # Options: "transaction", "
emitted_utxo", "consumed_utxo"
self.category = category
self.description = description

def calculate(self, data: pd.DataFrame) -> float:


"""
Override this method to define the metric calculation.

:param data: A pandas DataFrame containing the blockchain data relevant


to this metric.
:return: The calculated metric value.
"""
raise NotImplementedError("This method should be overridden by
subclasses.")

7.1.2.2 Case Studies of Successful Integration

Avalanche: GodSight successfully integrates Avalanche, a multi-chain blockchain, by


creating separate subchains for X, P, and C in the meta.json file and extracting transaction
data via open-source APIs. The Mapper file ensures that unique Avalanche data fields are
translated into the framework’s general model. This mapping has allowed for accurate
computation of metrics such as transaction volume and total staked amount. Custom
metrics like ’Total Stacked Amount’ give a specific economic indicator reflecting the
blockchain’s performance.
Bitcoin: As a single-chain blockchain, Bitcoin’s data extraction is configured to re-
turn transaction data in the ’default’ subchain. The Mapper file simplifies this integration,
ensuring Bitcoin’s native feature set aligns with GodSight’s general model.

7.1.3 USER EXPERIENCE AND INTERFACE USABILITY

The GodSight dashboard is designed with user experience in mind, providing a compre-
hensive and intuitive interface for analyzing on-chain data. The dashboard achieves this
by offering a clean, responsive layout that allows users to quickly access and interpret
metrics.

35
7.1.3.1 Usability and Presentation

Figure 7.1: Metric Chart with filters

The dashboard uses React.js to deliver dynamic, highly interactive visualizations. The
charts and graphs effectively present metrics across different blockchains and subchains,
allowing for an intuitive exploration of blockchain activity. Users can apply filters to
customize their views, making it easy to focus on specific metrics or periods.

7.1.3.2 Target Users

The target users of the GodSight framework typically include blockchain researchers,
analysts, and developers who have a solid understanding of blockchain structures and
features. These users should also possess basic coding skills to efficiently work with the
framework. The design of the framework aligns with these requirements, ensuring that the
core functionality is both comprehensive and accessible to those who meet the knowledge
prerequisites.
Blockchain Knowledge Requirement: Given the specialized nature of on-chain
analysis, the framework provides advanced tools that require users to have a thorough un-
derstanding of blockchain networks, transaction structures, and subchain features. This
understanding helps users accurately map and interpret data, particularly when adding
new blockchains to the framework or defining custom metrics.
Coding Skills: To work effectively with the framework, users need a basic grasp of
Python programming. This skill is necessary for creating complex metrics through the
code-level add-blockchain option and for modifying the computation logic as needed. For
simple metrics, however, the JSON-based API makes the process straightforward.

36
7.1.3.3 Ease of Creating Simple Metrics through the API

Figure 7.2: Creating Simple Metrics from the Dashboard

Creating simple metrics through the API is a powerful feature of the framework, con-
tributing to user engagement and productivity. Users can define metrics by providing a
JSON-based formula in one of two supported formats:

1. Format 1: This simpler format allows users to define metrics through basic ag-
gregation functions like sum, as seen in the "total_fees" example. The aggregation
functions are straightforward and provide an easy entry point for defining metrics.
Users specify the blockchain, subchain, aggregation column, and the aggregation
function itself.
formula = {
"aggregations": [
{
"name": "total_fees",
"column": "fee",
"function": "sum"
}
],
"final_answer": "total_fees"
}
print(formula)

2. Format 2: This format offers greater complexity for more nuanced metric calcu-
lations. Users can specify multiple aggregations, arithmetic operations, and condi-
tions on the data, such as filtering by transaction type. For instance, in "Normalized
Adjusted Send Fee Impact," aggregations include functions like sum and avg, while

37
arithmetic operations like subtraction and division are applied to intermediate re-
sults.
formula = {
"aggregations": [
{"name": "sum_amount_send", "column": "amount", "function": "sum", "
condition": {"column": "tx_type", "value": "send"}},
{"name": "avg_fee_send", "column": "fee", "function": "avg", "condition
": {"column": "tx_type", "value": "send"}},
{"name": "count_receive", "column": "tx_type", "function": "count", "
condition": {"column": "tx_type", "value": "receive"}},
{"name": "min_fee", "column": "fee", "function": "min"}
],
"arithmetic": [
{
"name": "adjusted_avg_fee",
"operation": "subtraction",
"operands": ["avg_fee_send", "min_fee"]
},
{
"name": "final_metric",
"operation": "division",
"operands": ["adjusted_avg_fee", "count_receive"]
}
],
"final_answer": "final_metric"
}

These API-based metrics are suitable for relatively simple calculations and help users
quickly define new metrics without needing to modify the framework codebase. If more
sophisticated metrics are required, the user can still opt for the add-blockchain option
and define metrics at the code level.

7.1.4 COMPARATIVE ANALYSIS

In comparing GodSight with existing on-chain analysis frameworks like Nansen, Dune
Analytics, Glassnode, and IntoTheBlock, our solution stands out for its unique
features and improvements. These established platforms offer comprehensive on-chain
data, market analytics, and wallet tracking, but they are closed-source. This ’black box’
approach restricts transparency and openness, contrary to the core blockchain ethos.
Furthermore, their extendibility is limited, as integrating a new blockchain can lead to
significant delays, leaving users awaiting desired chains.
GodSight addresses these issues through an open-source, modular approach. Our Vi-
sualization Pane allows users to interact with and analyze blockchain data while accessing

38
the underlying code that powers these insights. This transparency empowers developers
to customize and enhance the framework to suit their needs. Its modular structure en-
sures that integrating new blockchain technologies is straightforward and user-friendly,
enabling rapid adoption. Users can seamlessly add custom metrics and refine existing
ones, creating a personalized analysis environment.
However, data extraction remains a challenging area. In GodSight, we faced chal-
lenges, particularly with the Glacier API, which has strict rate limits that prevented
parallel data extraction. This challenge required careful scheduling and optimization
within the Data Pane to align with API limitations while still delivering accurate and
consistent results. To address such limitations in the future, we will continue improv-
ing data extraction workflows, integrating multiple strategies like instrumenting nodes
directly or utilizing new APIs to enhance data completeness and accuracy.

7.1.5 FUTURE DIRECTIONS AND IMPROVEMENTS

The GodSight framework, while offering a solid foundation for extensibility and cus-
tomization, has ample potential for growth and enhancement. Several opportunities lie
ahead to bolster the framework’s technical capabilities, expand its feature set, and provide
broader support for blockchain analysis.

7.1.5.1 Technical Improvements

Optimized Extraction and Computation: Current extraction and computation pro-


cesses could be further optimized to reduce memory and time consumption. Techniques
such as data caching, parallel processing, and improved algorithms can enhance per-
formance, particularly when processing large volumes of transaction data. Streamlin-
ing these processes will also make the framework more scalable, accommodating new
blockchains with varied data sizes and structures.
Database Schema Refinement: Revising the database schema to accommodate
more efficient indexing and querying can also help improve computation speed. This re-
finement will facilitate faster metric calculations, especially for complex metrics requiring
multi-level data aggregation.

7.1.5.2 New Features and Functionalities

Framework Commands for Metadata and Metrics Updates: Adding command-


line utilities for updating blockchain metadata and metrics directly through framework
commands will offer users a more efficient way to keep their analysis up to date. These
utilities can streamline the integration process by automating the validation and testing
phases required when adding new blockchains.
Predictive Metrics via Machine Learning Models: With transaction data
stored in a structured, tabular format, there’s potential to integrate machine learning
models for predictive analysis. By applying these models, users can derive new, forward-

39
looking metrics that forecast network activity, economic trends, and potential market
anomalies. This capability would be particularly valuable for researchers and developers
looking to anticipate changes in blockchain ecosystems.

7.1.5.3 Broader Blockchain Support

Expanding Blockchain Integrations: While GodSight already supports Bitcoin and


Avalanche, adding support for more blockchains will further increase the framework’s
reach and utility. Expanding integrations to include other prominent networks such as
Ethereum or Solana, as well as emerging chains, will make the tool more versatile and
attractive to a broader user base.
Cross-Blockchain Metrics: Developing cross-blockchain metrics will enable com-
parative analyses, helping users evaluate blockchain ecosystems relative to one another.
These comparisons can yield insights into network performance, user behavior, and the
economic health of different blockchain platforms.

7.2 ILLICIT BITCOIN ADDRESS DETECTION

7.2.1 CREATED DATASET DETAILS

The finalized dataset comprises Bitcoin addresses labeled according to four specific classes,
with a dominant proportion of non-illicit transactions:

• Class 0: Non-illicit addresses (over 90% of the data)

• Class 1: Blackmail

• Class 2: Darknet Market

• Class 3: Tumbler

The dataset contains transaction data from 2018 to 2021, sourced from the BABD-13
dataset. Most addresses are involved in only a few transactions (typically 1 or 2), which
created challenges in computing statistical features like skewness, kurtosis, and variance.
These features require a minimum number of data points to produce meaningful values,
so those columns were removed due to high null values. The resulting dataset contains
61 features, along with a label column.

40
Figure 7.3: Null data description

Class Record Distribution after Cleaning

• Non-Illicit (Class 0): Approximately 57,000 Bitcoin addresses

• Blackmail (Class 1): Approximately 9,000 records

• Darknet Market (Class 2): Approximately 9,000 records

• Tumbler (Class 3): Approximately 11,000 records

To evaluate model performance, the dataset was split into training and test sets. The
test dataset was designed to mimic real-world conditions, where the vast majority of
transactions are non-illicit. Thus, it comprises 99% non-illicit data and only 1% from the
anomaly classes. Specifically, the test dataset includes:

• Non-Illicit: 17,500 records

• Blackmail (Class 1): 100 records

• Darknet Market (Class 2): 100 records

• Tumbler (Class 3): 100 records

The training dataset also reflects the imbalance present in the original data.
This division ensures the models are evaluated on a challenging dataset with a realistic
imbalance between illicit and non-illicit transactions, enabling practical assessments of
model effectiveness in detecting anomalies.

41
7.2.2 FEATURE SELECTION

Feature selection plays a vital role in identifying the most informative attributes for
anomaly detection. By reducing the feature space to the most relevant columns, models
can learn more efficiently and yield better predictions. In this research, tree-based models
like Random Forest were used to rank feature importance for both multi-class (four
classes) and binary (illicit vs. non-illicit) classification tasks.

Top 10 Features in Multi-Class Dataset

For the multi-class dataset, the following features emerged as the most important:

1. input_spending_value_usd_75th_percentile

2. output_value_usd_median

3. output_value_usd_25th_percentile

4. output_value_usd_minimum

5. input_spending_value_usd_25th_percentile

6. output_value_usd_75th_percentile

7. output_value_usd_maximum

8. input_spending_value_usd_median

9. input_spending_value_usd_minimum

10. output_value_minimum

These features encompass key statistical measures such as percentiles and medians for
transaction values in both input and output transactions. Percentile-based values help
distinguish subtle differences in transaction behaviors across the different classes.

Top 10 Features in Binary Dataset (Illicit vs. Non-Illicit)

When grouping all illicit classes together and comparing them against non-illicit trans-
actions, these ten features stood out:

1. input_spending_value_usd_75th_percentile

2. output_value_usd_median

3. output_value_usd_25th_percentile

4. input_spending_value_usd_25th_percentile

5. input_spending_value_usd_median

42
6. output_value_usd_75th_percentile

7. input_output_usd_max_ratio

8. input_output_usd_min_ratio

9. input_output_usd_mean_ratio

10. output_value_usd_maximum

The ratios between input and output values offer valuable insights into patterns that
distinguish illicit behavior, while the percentile and median-based features provide crucial
statistical measurements.
In both classification approaches, input and output transaction values, especially at
various percentiles, play a significant role in identifying illicit behaviors. The presence
of input-output ratios among the top features in the binary dataset highlights their ef-
fectiveness in distinguishing between illicit and non-illicit addresses. Overall, focusing
on these top features allows machine learning models to detect patterns indicative of
different types of behavior, simplifying the complex task of identifying Bitcoin anomalies.

7.2.3 DATA PRE-PROCESSING

1. Variance Thresholding The first step in data pre-processing involved applying


variance thresholding to eliminate features with low variance. Features with little
variance across the dataset contribute minimally to differentiating between classes
and can introduce noise into the model. By removing these features, the data was
reduced to more informative attributes, ensuring better model performance and
reducing computational complexity.

2. Data Normalization Various normalization techniques were explored to scale


the data to a standard range and improve the performance of machine learning
algorithms:

• Min-Max Scaling: Rescales each feature to a range between 0 and 1.


• Quantile Transformer : Transforms data to follow a uniform or normal distri-
bution.

Quantile Transformer provided superior performance in the context of this dataset


by effectively handling skewed distributions and outliers. It ensured that each fea-
ture had a similar distribution and range, which helped the models discern patterns
more accurately.

3. Principal Component Analysis (PCA) Principal Component Analysis (PCA)


was then applied for dimensionality reduction. The primary reasons for using PCA
include:

43
• Noise Reduction: By projecting data onto fewer dimensions, PCA filters out
the noise inherent in high-dimensional data.
• Efficiency: Reducing the number of features speeds up model training and
simplifies the model’s complexity.
• Multicollinearity Mitigation: PCA orthogonally transforms the features, re-
ducing multicollinearity between them.

Different numbers of PCA components were tested (10, 15, 20, and 30). After
experimentation, using 30 components proved to be the most effective, as it retained
enough variance to capture the essential structure of the dataset while reducing
dimensionality significantly.

7.2.4 MODEL TRAINING AND RESULTS ANALYSIS

The training phase encompassed a variety of models, including tree-based methods like
Random Forest and XGBoost, alongside tabular neural networks like TabNet, GATE,
NODE, and a modified version of TabNet. Each model was evaluated on the same
imbalanced test dataset to ensure consistency and comparability.

7.2.4.1 Results Overview

After evaluating each model, the following results were obtained:

Model Total Anomaly Anomaly Anomaly Weighted


Accuracy Class 1 Class 2 Class 3 F1 Score
Accuracy Accuracy Accuracy
GATE 0.575 0.64 0.75 0.83 0.71
NODE 0.559 0.59 0.74 0.65 0.70
Random 0.691 0.60 0.71 0.57 0.70
Forest
Tabnet 0.569 0.66 0.70 0.66 0.71
Modified 0.712 0.60 0.74 0.66 0.74
Tabnet

Table 7.1: Model Comparison

7.2.4.2 Analysis and Interpretation

The results reveal interesting patterns and trends among the models:

• Overall Performance: The modified TabNet model achieved the highest total
accuracy (0.712) compared to all other models. This improvement underscores the
effectiveness of the modifications, particularly the addition of a 1D CNN feature

44
extraction layer and switching the activation function to PReLU. These enhance-
ments helped the model distinguish between the nuanced patterns of non-illicit and
illicit addresses, leading to improved classification performance.

• Class-wise Accuracy: Across different anomaly classes, most models performed


consistently, indicating their ability to identify patterns across different illicit classes.
However, the Random Forest and modified TabNet models excelled in overall accu-
racy due to their improved ability to classify non-illicit transactions more accurately.

• Comparing TabNet Models: The original TabNet model exhibited slightly lower
performance than the modified version. By incorporating the 1D CNN feature
extraction layer and the more adaptive PReLU activation function, the modified
TabNet model benefited from a more efficient feature selection process and greater
discriminatory power. This allowed the model to reduce the gap between illicit and
non-illicit classifications, leading to better overall accuracy.

The improved performance of the modified TabNet model demonstrates the value
of tailored model architectures for detecting Bitcoin anomalies. The model effectively
captured important patterns across all classes, particularly in the non-illicit class, which
made up the vast majority of the test set. This improvement supports the hypothesis
that enhanced feature extraction and activation functions can significantly contribute to
better classification outcomes in tabular neural networks.

45
Chapter 8
CONCLUSION

This research focused on two key developments: the GodSight on-chain analysis frame-
work and a Bitcoin anomaly detection model, both designed to advance blockchain data
analysis through comprehensive insights and anomaly detection. GodSight, an open-
source and modular framework, provides efficient data extraction and analysis with ac-
curate, real-time metrics through a responsive dashboard. Its user-defined metric cus-
tomization and multi-blockchain support offer unparalleled adaptability. Compared to
existing platforms like Nansen and Glassnode, GodSight stands out for its versatility,
responsiveness, and efficient data extraction.
The Bitcoin anomaly detection model leveraged a newly created tabular-based dataset
built upon statistical features derived from transactional data. This dataset, which simpli-
fies the representation of Bitcoin address behaviors compared to the graph-based BABD-
13 dataset, includes 76 features like medians, quantiles, and ratios. By differentiating
transactional patterns between input and output activities, it offers an alternative ap-
proach to uncovering behavioral patterns. The modified TabNet model, incorporating a
1D CNN layer and PReLU activation, achieved high accuracy in detecting illicit activi-
ties, highlighting the potential of this tabular dataset in distinguishing between legitimate
and suspicious Bitcoin addresses.
Moving forward, the optimization of GodSight’s data extraction processes and schema
structure will further improve its analysis capabilities. The integration of predictive anal-
ysis metrics via machine learning models will enhance GodSight’s insights. Expanding
cross-chain metrics, refining specialized neural network architectures, and improving pre-
dictive analysis will ensure the framework remains a robust tool for comprehensive on-
chain analysis. These advancements will position GodSight as a transformative solution
that not only identifies anomalies with precision but also delivers actionable intelligence
for blockchain data analysis.

46
REFERENCE

[1] K. Wüst and A. Gervais, “Do you need a blockchain?,” tech. rep., Department of
Computer Science, ETH Zurich; Department of Computing, Imperial College Lon-
don, 2017. [Online]. Available: https://eprint.iacr.org/2017/375.pdf.

[2] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” tech. rep., 2008.
[Online]. Available: https://bitcoin.org/bitcoin.pdf.

[3] K. Sekniqi, D. Laine, S. Buttolph, and E. G. Sirer, “Avalanche platform,”


tech. rep., 2020. [Online]. Available: https://assets.website-files.
com/5d80307810123f5ffbb34d6e/6008d7bbf8b10d1eb01e7e16_Avalanche%
20Platform%20Whitepaper.pdf.

[4] J. Strebko and A. Romanovs, “The advantages and disadvantages of blockchain tech-
nology,” 2018.

[5] Z. Zheng, S. Xie, H.-N. Dai, X. Chen, and H. Wang, “Blockchain challenges and
opportunities: A survey,” International Journal of Web and Grid Services, vol. 14,
p. 352, 2018.

[6] A.-F. Ţicău Suditu, “Blockchain technology and electronic wills,” tech. rep. [Online].
Available: https://www.academia.edu/77104514/Blockchain_Technology_and_
Electronic_Wills.

[7] V. M. Araujo, J. A. Vázquez, and M. P. Cota, “A framework for the evaluation of saas
impact,” International Journal in Foundations of Computer Science & Technology
(IJFCST), vol. 4, May 2014. [Online]. Available: https://arxiv.org/pdf/1406.
2822.pdf.

[8] Özdemir and Yılmaz, “The effect of using digital storytelling in geography lessons
on students’ academic achievement and their attitudes towards geography lesson,”
Journal of Education and Training Studies, vol. 6, no. 4, pp. 15–27, 2018.

[9] M. Godse and S. Mulik, “An approach for selecting software-as-a-service (saas) prod-
uct,” in CLOUD 2009 - 2009 IEEE International Conference on Cloud Computing,
pp. 155–158, 2009.

[10] M. A. Khan and F. Urrehman, “An extendable opensource architecture of e-learning


system,” in 2013 Fourth International Conference on e-Learning "Best Practices in

47
Management, Design and Development of e-Courses: Standards of Excellence and
Creativity", pp. 22–26, IEEE, 2013.

[11] F. Aslam, “The benefits and challenges of customization within saas cloud solutions,”
American Journal of Data, Information and Knowledge Management, vol. 4, no. 1,
pp. 14–22, 2023.

[12] J. Hirshman, Y. Huang, and S. Macke, “Unsupervised approaches to detecting


anomalous behavior in the bitcoin transaction network,” tech. rep., Stanford Univer-
sity, Department of Mathematics and Department of Computer Science, Stanford,
CA, USA, 2013.

[13] Y. X. et al., “Babd: A bitcoin address behavior dataset for address behavior pattern
analysis,” arXiv preprint arXiv:2204.05746, 2022.

[14] T. Chen, Z. Li, Y. Zhang, X. Zhang, et al., “Dataether: Data exploration framework
for ethereum,” in 2019 IEEE.

[15] T. Chen, Z. Li, Y. Zhu, X. Zhang, et al., “Understanding ethereum via graph anal-
ysis,” ACM Transactions on Internet Technology, vol. 20, no. 2, 2020.

[16] J. Huang, W. Huang, and J. Ni, “Predicting bitcoin returns using high-dimensional
technical indicators,” The Journal of Finance and Data Science, vol. 5, no. 3, 2018.

[17] N. Jagannath, T. Barbulescu, K. Sallam, and K. Munasinghe, “An on-chain analysis-


based approach to predict ethereum prices,” IEEE Access, vol. 9, 2021.

[18] S. Sayadi, S. REJEB, and Z. CHOUKAIR, “Anomaly detection model over


blockchain electronic transactions,” in 2019.

[19] T. Hu, X. Liu, T. Chen, X. Zhang, X. Huang, W. Niu, J. Lu, K. Zhou, and Y. Liu,
“Transaction-based classification and detection approach for ethereum smart con-
tract,” Information Processing & Management, vol. 58, 2021.

[20] “An on-chain analysis-based approach to predict ethereum prices,” IEEE Access,
December 2021.

[21] “Predicting cryptocurrencies market phases through on-chain data long-term fore-
casting,” in International Conference on Blockchain and Cryptocurrencies, (Dubai),
July 2023.

[22] “Blockchain-based cryptocurrency price prediction with chaos theory, onchain anal-
ysis, sentiment analysis and fundamental-technical analysis,” Chaos Theory and Ap-
plications, November 2022.

[23] Nansen, “On-chain insights for crypto investors & teams,” tech. rep., Nansen
| On-chain Insights for Crypto Investors & Teams, 2023. [Online]. Available:
https://www.nansen.ai/. [Accessed: 08-Oct-2023].

48
[24] D. Analytics, “Blockchain ecosystem analytics by and for the community. explore and
share data from ethereum, bitcoin, polygon, bnb chain, solana, arbitrum, avalanche,
optimism, fantom and gnosis chain for free,” tech. rep., Dune Analytics, 2023. [On-
line]. Available: https://dune.com/browse/dashboards. [Accessed: 08-Oct-2023].

[25] Glassnode, “Glassnode studio - on-chain market intelligence,” tech. rep., Glassnode
Studio, 2023. [Online]. Available: https://studio.glassnode.com/home. [Accessed:
08-Oct-2023].

[26] IntoTheBlock, “Powering the intelligence layer of the crypto markets,” tech. rep.,
2023. [Online]. Available: https://www.intotheblock.com/. [Accessed: 08-Oct-2023].

[27] A. Aung, H. Aung, and N. Khaing, “Bitcoin transaction analysis using autoencoder
for anomaly detection,” in Proceedings of the 4th International Conference on Big
Data and Internet of Things (BDIoT 2021), pp. 109–113, 2021.

[28] H. Le, C. Strufe, and B. Meinel, “Ethereum smart contract detection using cluster-
ing,” in Lecture Notes in Computer Science, vol. 11401, pp. 32–42, 2019.

[29] C. G. Akcora, Y. Li, and M. Kantarcioglu, “Bitcoin heist: Topological data analysis
for ransomware detection on the bitcoin blockchain,” in Proceedings of the 2019 IEEE
International Conference on Data Mining (ICDM), pp. 1–8, 2019.

[30] M. W. et al., “Anti-money laundering in bitcoin: Experimenting with graph convolu-


tional networks for financial forensics,” in Proceedings of the 28th USENIX Security
Symposium (USENIX Security 19), pp. 1–19, 2019.

[31] R. M. et al., “Illicit activity detection in bitcoin transactions using timeseries analy-
sis,” International Journal of Advanced Computer Science and Applications, vol. 14,
no. 3, pp. 13–18, 2023.

[32] V. V. et al., “Fraud detection in bitcoin transactions,” in Proceedings of the 2022


IEEE International Conference on Information Technology (ICIT), pp. 1–8, 2022.

[33] Q. Z. et al., “Cryptocurrency fraud detection using statistical features,” Journal of


Neural Computing and Applications, vol. 33, pp. 4459–4470, 2021.

[34] D. P. J. Bahnsen and B. Edwards, “Bitcoin fraud detection using random forest
models,” Expert Systems with Applications, vol. 137, pp. 156–163, 2018.

[35] H. T. Y. Gai and J. Sun, “Anomaly detection using xgboost model,” IEEE Transac-
tions on Blockchain, vol. 22, no. 6, pp. 1–8, 2021.

[36] A. C. et al., “Ensemble classification for fraud detection in blockchain transactions,”


in Proceedings of the 2021 IEEE Conference on Machine Learning (ICML), pp. 1–10,
2021.

49
[37] B. Guo, “1d-cnn: Best single model in kaggle competition,” 2021. [Online]. Available:
https://github.com/baosenguo/Kaggle-MoA-2nd-Place-Solution.

[38] S. O. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learning,” arXiv
preprint arXiv:1908.07442, 2019.

[39] M. Joseph and H. Raj, “Gate: Gated additive tree ensemble for tabular classification
and regression,” arXiv preprint arXiv:2207.08548, 2022.

[40] S. M. S. Popov and A. Babenko, “Neural oblivious decision ensembles for deep learn-
ing on tabular data,” arXiv preprint arXiv:1909.06312, 2019.

[41] N. Jagannath, T. Barbulescu, K. M. Sallam, I. Elgendi, B. McGrath, A. Jamalipour,


M. Abdel-Basset, and K. Munasinghe, “An on-chain analysis-based approach to pre-
dict ethereum prices,” IEEE Access, vol. 9, pp. 167972–167989, 2021.

[42] Z. Huang and F. Tanaka, “A scalable reinforcement learning-based system using on-
chain data for cryptocurrency portfolio management,” ArXiv, vol. abs/2307.01599,
2023.

[43] Z. Zheng, S. Xie, H. Dai, X. Chen, and X. Wang, “An overview of blockchain technol-
ogy: Architecture, consensus, and future trends,” in IEEE 6th International Congress
on Big Data, pp. 557–564, 2017.

[44] A. E. Gencer, S. Basu, I. Eyal, R. van Renesse, and E. G. Sirer, “Decentralization in


bitcoin and ethereum networks,” pp. 439–457, 2018.

[45] L. Cong and Z. He, “Blockchain disruption and smart contracts,” Review of Financial
Studies, vol. 32, no. 5, pp. 1754–1797, 2019.

[46] S. Nanayakkara, M. Rodrigo, S. Perera, G. T. Weerasuriya, and A. A. Hijazi, “A


methodology for selection of a blockchain platform to develop an enterprise system,”
J. Ind. Inf. Integr., vol. 23, p. 100215, 2021.

[47] T. T. A. Dinh, J. Wang, G. Chen, R. Liu, B. Ooi, and K. Tan, “Blockbench: A


framework for analyzing private blockchains,” in Proceedings of the 2017 ACM In-
ternational Conference on Management of Data, 2017.

[48] L. Bach, B. Mihaljević, and M. Zagar, “Comparative analysis of blockchain consensus


algorithms,” in 2018 41st International Convention on Information and Communi-
cation Technology, Electronics and Microelectronics (MIPRO), pp. 1545–1550, 2018.

50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy