Write Up for Models Predicting Sybil Scores of Wallets

What a great thread this is. Thanks to all contributors. I don’t have a writeup, but want to bring your attention to a new app available in the bot-proofing and sybil-detection space: CUBID Protocol which takes a lot of inspiration from the old Gitcoin Passport and introduces a several key improvements:

  • email signup and login, as opposed to metamask
  • ability to add identity proofs from multiple chains
  • app-scoped identities, preventing user tracking across apps or networks
  • optional embedded app-scoped EVM accounts creation (abstracted wallets)
  • SDK with embedded UI componentry (no need to send users to external site for identity management)
  • user-controlled full or partial sharing of their identity, or only sharing the score
  • and more

Feel free to try it out. Embedding CUBID into your app is just a matter of a few lines of code. Also, while it already works: we are still very much in development and looking for devs to help out.

yes it’s expected the AUC score as the term implies Area under the curve basically measures how correctly your model was able to distinguise between sybil and non-sybil for this to happen you need to have the correct labelled dataset, the AUC score you got while training as such won’t be the same with the test beause it’s a new set of addresses and only the hosts of the competition currectly have the completely labelled dataset against which to check how correctly your model was able top predict, so when they check it, they’ll let you know how well your model’s prediction on the test address was able to correctly labell sybil vs non sybil
don’t know if that was clear enough

yeah, i totally understand you, i was actually refering to the fact that the train and test features after data preparation have diffrent distributions

Sybil Detection Model – Competition Writeup

Pond AI Platform name: Limonada

In this competition, I built a machine learning pipeline to detect Sybil wallets using behavioral features extracted from Ethereum and Base chain data.

Data Overview

  • Addresses: From Ethereum and Base chain data.

  • Transactions: Regular transactions, token transfers, and DEX swaps.

  • I merged and standardized data from both chains to enable consistent feature generation.

Feature Engineering

I created features through different groupings and aggregations over the transactional data. These behavioral metrics helped highlight subtle differences between Sybil and legitimate users.

Handling Class Imbalance

Since the dataset was imbalanced, I experimented with several undersampling techniques to improve learning:

  • NearMiss (various versions)

  • TomekLinks

These approaches helped in exploring different class balancing strategies.

Model & Training

To address the classification challenge, I focused on building a robust stacked ensemble model. My strategy was to combine several diverse base models and use a meta-model to learn how to best integrate their predictions.

Modeling Approach

Base Models Used:

  • Random Forest (with tuned hyperparameters)

  • XGBoost

  • LightGBM

  • Logistic Regression

  • SVC (with probability outputs)

  • Multi-Layer Perceptron (MLP)

Each base model was trained using 5-fold Stratified Cross-Validation, generating:

  • Out-of-fold (OOF) predictions for the training set

  • Averaged predictions for the test set

Stacking Ensemble:

I fed the OOF predictions from all base models into a second-layer model (meta-model). After experimenting with different options, I selected the best performers based on cross-validated AUC.

Meta-Models Used:

  • RidgeClassifierCV

  • XGBoost

Evaluation Metric:

  • ROC AUC was used throughout to measure performance and guide model selection.

Results

The final ensemble achieved an AUC of 0.998, demonstrating near-perfect ability to distinguish Sybil wallets from real users.

Closing Thoughts

Blending blockchain intuition with behavioral modeling proved powerful. The ensemble was able to capture subtle but consistent patterns in wallet activity that traditional heuristics might miss.

Sybil Detection Challenge

Hello folks! Here are some details about the approach I took on the Sybil Detection Challenge.
It was a great contest and made me explore new techniques I wasn’t familiar with (e.g: node2vec).

I’ll be sharing the path instead of only the “final” setup as I think it’s interesting in this case.
Before jumping into that, here are some of the most interesting reads I’ve found while researching for this competition.

Resources

Improving the Baseline

My first step was to get a baseline model going.
It’s really useful to have an end to end thing working as soon as possible to make iterations easier and faster.
I then submitted a couple of dummy predictions (all preds to 0.5) just to make sure they worked.
Once the model was setup, I added some stats from the provided datasets.
Simple aggregations like number of transactions, total value in/out, …

Adding these features and changing the model to a Random Forest started producing ROC AUC scores around 0.98 on a 5 fold local cross validation.
This is interesting as it means the models are able to learn really well the training dataset.
The results on the leaderboard data where different, though (ROC AUC around 0.8).
That meant the test set has a different distribution of sybil wallets (training is around 3% while test might be around 10%).

While exploring the test set to verify that hunch, I continued adding all the features I could think of from the provided datasets.

Adding Features

I added more aggregations at different levels (transaction, network, …), and also for the different wallets (from will produce aggregations for senders and to for receivers). Doing this over most of the column resulted in around 300 features. Some of the most interesting ones:

  • When that wallet received its first and last transactions
  • Number of unique tokens it used
  • How many wallets have interacted with it and with how many wallet has it interacted
  • Data about the address it funded the wallet (label, value of first tx, id, …). The goal was to get information about the founding event.

Adding these features raised the ROC AUC score to 0.9904, which confirmed the initial suspition that the training data didn’t contain much more predictive power.

Nonetheless, I continue adding more features. Many of them were hand crafted features based on intuition (ratios, activity metrics, …) but also spent a large amount of time trying to derive useful features from the interaction graph that these wallet form.

Once I constructed the graph, I was able to extract many interesting new features:

  • Classic graph metrics like Degree, PakeRank, Centrality, Cluster, …
  • Levain community clusters and it’s population size
  • An embedding (64 values) of each wallet in the graph

This turned out to be important. Specially since it gives use more values we can “Target Encode”.

The important one is the “community” and the “degree”. Basically, you’re giving the model the average sybil-ness of it’s communiy and of people that interacted with the same number of wallets.

This moved the score to 0.9984.

I also explored these services’ APIs with the hope of augmenting the training data. No luck there as they were very limited in the number of requests I could make.

Finally, I found a bunch of CSVs online that contained a list of sybil wallets. I joined them to our training data and computed the average sybil-ness of the “potentially sybil” wallets. Since these weren’t 100% hits (the online lists have many false positives), I created a new feature that was basically if I did find wallet X on any of the lists.

Cleaning Data

Another thing I realized while testing out thigns is that the training data contained some addresses that were contracts.

And seems the same thing happens within the test dataset.

That means that, if we get a list of contracts, we can mark these as “non sybil”. It’ll make our training dataset cleaner and ensure some accurate predictions in tests.

I got the data from a couple of Dune/Flipside queries, joined it, and did some lightweight preprocessing.

Also added some extra cleaning steps like removing columns with only null values and removing features with low variance.

Later one, I updated some of the training labels based on Ben2k’s false negatives and an Out Of Fold prediction with the best model.

Improving the Models

Since the start, I knew my model had a few problems:

  • Too many columns.
  • I was discarding the Datetime Features.
  • Categorical features where being label encoded instead of OOE.

I spent some time improving the model pipeline to fix that. Added a feature selection step to keep only 100 features based on importance. Processed properly the datetime and categorical features. And, finally, moved to LGBM instead of Random Forests.

With all of this the local ROC AUC was of 0.99912. I’m sure there are a few more things that can be done (like subsampling with NearMiss or similar) but didn’t want to spend more time trying out things if the score was already that close to 1.

How could I check the generalizability of this approach? Well, there is another Pond competition that has the same exact training data.

I downloaded it, changed the read_csv() to these files, run the model with predict instead of predict_proba and sent the submission (I had sent a dummy one earlier to check the shape). The result was promising.

Since the evaluation windows were taking their time. I started working on a different approach and changed the shape of the problem to “Event Sequence Classification”. That means I would build a model that took a bunch of events (sent tx, swapped, sent another tx) and categorize them as sybil or not. Unfortunately, I wasn’t able to finish with this approach as I got stuck figuring out the metadata and how to make a model understand that. This is something companies do a lot, for example, when predicting a trial conversion or if an user will churn.

Postprocessing

The postprocessing step was simple yet effective. Since I have a list of labeled contracts from Flipside, I mark them as non-sybil directly!

The last submission I sent is also an ensemble of the previous 20 ones (where I tried all sorts of feature combinations and models)!

Conclussion

Excited to know the final results as the final dataset is quite different from the initial one. I’m sure there are a bunch of things I missed as I didn’t explore this second training dataset as much as I would have liked.

Another side-learning is that these kind of competitions will train models that overfit the sybil types that have been seen in the training data. If a sybil wallet is on the training data but not marked as sybil, the model will learn to classify it as non-sybil and the model will be trained to ignore these. Here, more than in other type of competitions, is where we need a plurality of models, each trained with different approaches and datasets.

2 Likes

Data Collection and Preprocessing

  • Data from both chains (base and ethereum) were concatenated to create unified training and test datasets.

  • Merging operations were conducted using:

    • Address (ADDRESS) ↔ FROM_ADDRESS from transactions
    • Address ↔ ORIGIN_FROM_ADDRESS from token transfers

Cleaning

  • Duplicates were removed using drop_duplicates(subset='ADDRESS') to ensure address uniqueness.
  • Only the first occurrence of each address in merged datasets was retained.

Feature Engineering

A custom function build_merged_df was used to engineer rich, temporal and transaction-based features from DEX swaps. Features included:

  • Number of unique active days
  • Time since last transaction
  • Transaction period span in days
  • Total USD value of tokens swapped in/out
  • Average time between swaps
  • Count of transactions per address

Additional features from transaction and token transfer data were included:

  • Gas fees, values, function signatures, raw token metrics
  • Categorical fields like SYMBOL, PLATFORM, TOKEN_OUT

Modeling Strategy

A Stratified K-Fold Cross-Validation (5 folds) strategy was adopted to ensure class distribution balance across splits.

Models Used

Four machine learning models were trained and evaluated using ROC AUC score as the performance metric:

  1. LightGBM
  2. XGBoost
  3. CatBoost
  4. HistGradientBoostingClassifier

Test Data Preprocessing

Test data underwent the same transformation and feature engineering pipeline as the training data.

Prediction

  • Predictions were obtained from each model.
  • A simple average ensemble of the 4 model probabilities was used to generate the final prediction.

ensemble_preds = (xgb_preds + lgb_preds + cat_preds + hgb_preds) / 4

Sybil Detection Model Writeup: Crypto Pond Competition

Participant: Stakeridoo
Email : stakeridoo@gmail.com
Telegram: Telegram: Contact @ben2k_stakeridoo
X: http://x.com/stakeridoo

Introduction

Imagine a protocol airdrop worth millions — and a single actor claiming thousands of shares using fake identities. Sybil attacks aren’t just theoretical risks; they’re recurring exploit vectors that undermine decentralization at its core. Detecting them is not simply a machine learning problem, but a matter of understanding coordination, incentives, and deception on-chain. This project reflects years of real-world analysis distilled into a practical pipeline that blends custom SQL heuristics, feature-rich LightGBM modeling, and a careful balance between precision and recall.


Data Sources & Preprocessing


Full SQL Dune Dashboard available here

The project leveraged both the Base and Ethereum datasets provided by the organizers, and three tailored SQL-derived sources:

  • Funding Analysis: Focused on initial ETH inflows to establish early clustering and synchronized deployment patterns.
  • Wallet Lifecycle Profiling: Captured first/last transaction times and frequency patterns, chain-specific.
  • Peer Transfers Only: Constructed exclusively from transfers between known addresses, deliberately excluding unfiltered inflow/outflow events, particularly CEX deposit sources.

These SQL-enhanced datasets were designed to reduce false positives by eliminating noise from non-relevant financial activity, such as CEX inflows, and to sharpen contrast between normal users and orchestrated Sybil structures. For example, the exclusion of non-peer transfers prevented misleading clusters that would have otherwise diluted detection accuracy.

The training dataset contains relatively few Sybils overall compared to the full dataset:

  • Ethereum train set: 52,501 addresses total, of which ~8,827 are labeled as Sybil (~16.8%)
  • Base train set: 51,515 addresses total, of which ~9,936 are labeled as Sybil (~19.3%)

This imbalance makes precision and effective false positive suppression all the more critical.
I identified mislabeling and omissions in the training set, particularly false negatives. My analysis flagged several wallet groups with clustered activity, suspicious funding synchronicity, and shared behavioral patterns that were previously unmarked. This includes influential edge cases where legitimate users might be misclassified as Sybil due to atypical activity.

Validation Case: The test set includes the wallet 0xf4b0556b9b6f53e00a1fdd2b0478ce841991d8fa , also known as Olimpio, a widely known airdrop influencer. While highly active and cross-chain present, Olimpio is not a Sybil — yet provides a perfect benchmark to test for false positives in high-activity profiles. The model correctly refrains from mislabeling it, validating its robustness. Such edge cases serve as a qualitative validation layer and help benchmark the model’s caution versus aggressiveness.


Methodological Philosophy

Sybil detection is a precision craft. Off-the-shelf classification techniques often conflate noise with signal, especially when features are shallow and uncurated. My philosophy leans on three core principles:

  • Prune noise at the source: Using SQL logic to remove irrelevant CEX flows and reduce cluster confusion.
  • Feature quality > quantity: Focused, behaviorally-motivated attributes rooted in years of observing Sybil farming, airdrop abuse, and DAO manipulation.
  • Validate with edge cases: Ensure real users like Olimpio or campaign overperformers are not falsely flagged.

The internals of my feature design are intentionally undisclosed. What I will share is that they emerged not from this competition but from a longer journey across DAO governance, bridge farming, identity farming, and network graph studies. This “secret sauce” is built on real airdrop abuse detection, not theory. You might think of it as a forensic lens — tuned to subtle patterns invisible to raw metrics — leveraging timing asymmetries, funding fingerprints, and chain-specific behavioral nuances. Certain constructions — involving funding path alignment, coordination delay gaps, and meta-cluster overlap — are subtle, proprietary, and designed to remain stealthy.


Pipeline Overview

  1. Load Data: Dexswaps from Base & Ethereum + SQL-enhanced auxiliary features
  2. False Negative Identification: Heuristic sweep on train using historical clustering
  3. Label Normalization: Consolidation, deduplication, cast
  4. Feature Generation (internally structured into subscores):
  • Activity-based
  • Funding pattern-based
  • Network proximity (via direct ERC20/native transfers)
  • DEX usage patterns (from Parquet)
  1. Graph Metrics: pagerank , betweenness , clustering , two_hop_pathing
  2. Modeling: LightGBM with scale_pos_weight, tuned via CV
  3. Calibration: Isotonic regression with CalibratedClassifierCV
  4. Submission Creation: Decision threshold = 0.5

Feature Importance

Top-ranked features included transaction frequency, funding source diversity, and graph proximity metrics. These were consistently impactful in identifying both subtle and overt Sybil coordination patterns, especially when combined into normalized subscore groups.

My model’s core relies on several composite scores. Each one synthesizes multiple dimensions:

  • Lifecycle depth (age, frequency, recency)
  • Funding velocity (delay after creation, shared funding roots)
  • Graph centrality (proximity to known Sybils or funding centers)
  • Economic behavior (DEX usage under real vs. fake cost profiles)

These features outperform simpler metrics like raw count or USD transferred. Graph augmentation was particularly effective in filtering ambiguous addresses otherwise passed as clean.


Score Distribution

  • Mean Score: ~0.30
  • Median Score: ~0.15
  • ≥0.50 (classified Sybil): 4,776 / 20,369
  • ≥0.95 (high confidence Sybil): 2,895

A strong bimodal distribution reflects the model’s ability to separate clearly between likely Sybils and legitimate users — reducing edge-case ambiguity and streamlining operational decisions.


Evaluation Metrics

Metric Training Validation
AUC 0.9564 0.9528
Binary Logloss 0.1963
Early Stopping Round 96 96
F1 Score 0.7181
Number of False Positives 3,086 31
Number of False Negatives 272 210
False Positive Rate 4.00% 0.16%
False Negative Rate 13.15% 40.62%
True Positive Rate 86.85% 59.38%
True Negative Rate 96.00% 99.84%

The relatively high false negative rate in the validation split is attributable to the fragmentation of large Sybil clusters across train/test boundaries. When a cluster is broken up during an 80/20 split, the model may see only partial context during training, resulting in some Sybil addresses being missed. This is amplified by the low overall share of labeled Sybils in the dataset.

Despite this, the model preserves an extremely low false positive rate, underscoring its cautious design philosophy — a critical trait for any system meant for production deployment where trust and integrity matter.


Final Remarks

This submission is the result of a matured process — not just a machine learning pipeline, but an investigative framework built to scale with adversarial creativity. Rather than chasing leaderboard overfitting, I’ve focused on building a system grounded in clarity, robustness, and practical deployment.

I approached this challenge not merely as a data scientist but as a long-time Sybil hunter. Over the past years, I’ve tracked airdrop farming rings, studied bridge farming, and analyzed DAO governance patterns. The patterns I learned — and the SQL queries and Python scripts honed along the way — formed the backbone of this approach.

My tooling isn’t generic, nor is it meant to be. It’s the result of deeply specialized use cases, gradually refined to minimize false positives without losing structural insight into deceptive coordination. From carefully filtered ERC20 funding paths to DEX-based behavioral modeling, each layer builds on a forensic mindset.

The edge is not in the code, but in knowing where to look.

This is why the inclusion of Olimpio in the test set felt so serendipitous. As a high-profile, high-activity participant with atypical transaction behavior, his address served as the ultimate test case for avoiding false positives. That the model passed this benchmark underscores the importance of building with domain awareness — not just statistical rigor.

What you see in the CSV is the output of a trained classifier. But what it represents is more than code — it’s a philosophy of defensive design. My system was built for clarity under pressure, not leaderboard aesthetics. That’s the distinction that enables real-world deployment. The maxim holds true here: you don’t win against Sybils by being fast — you win by being right. This is what separates a good model from a truly deployable one.

pond username: Oleh RCL
model: located on pond competition submission or can be provided as github repository.

date: 30/05/2025

Sybil Detection: Securing Web3 with Machine Learning and Blockchain Insights

The Challenge of Sybil Attacks

In the decentralized world of Web3, Sybil attacks—where bad actors spin up multiple fake identities—are a real headache. They can skew airdrops, hijack governance votes, or drain funding systems meant to support genuine users. The mission here was to build a model that sniffs out these sneaky Sybil wallets using historical blockchain data. My solution blends advanced feature engineering, graph analysis, and a beefy ensemble of machine learning models to tackle this head-on.

The Data Playground

I worked with a rich mix of datasets to fuel this model. We’ve got labeled data with around 2,500 known Sybil addresses, pulled from heavy hitters like Gitcoin Passport, LayerZero, zkSync, OP, Octant, and Gitcoin’s own ban lists. Then there’s the raw blockchain action: transaction records, token transfers, and DEX swaps from both the Base and Ethereum chains. To spice things up, I folded in a list of potential false negatives from Ben2k—think of it as a cheat sheet to catch Sybils that might’ve slipped through the cracks.

Crafting Features: The Heart of Detection

Good features are the secret sauce of any machine learning model, and I went all out here. I engineered a slew of features to capture the quirks of Sybil behavior, optimized for speed and insight. Here’s the rundown:

1. Temporal Features

  • Time since first transaction: How long has this wallet been around?
  • Mean and variance of transaction gaps: Are transactions steady or all over the place?
  • Transactions per hour: How busy is this address?
  • Burstiness: Spotting sudden flurries of activity—Sybils often can’t help but overdo it.

2. Transaction Counts

  • Total transactions: Raw activity level.
  • Transactions per day: Normalized to see if it’s a slow burner or a hyperactive spammer.

3. Velocity

  • Transaction velocity: Total value moved divided by active days. Fast movers might be up to no good.

4. Token Diversity

  • Unique tokens: Is this wallet a jack-of-all-trades or obsessed with one token? Sybils often farm specific airdrops.

5. Chain Preferences

  • Base transaction ratio: How much action happens on Base versus Ethereum? Sybils might lean one way.

6. Swap Behavior

  • Ethereum-to-Base swap ratio: For DEX users, this tracks cross-chain habits—Sybils might show odd patterns.

7. Entropy

  • Token entropy: How random are the tokens they touch?
  • Hourly entropy: Are transactions timed like clockwork or chaotic? Bots tend to be predictable.

8. Graph Features

I built a transaction graph with NetworkX and pulled out:

  • In-degree/out-degree: How many wallets send to or receive from this one?
  • Clustering coefficient: Are its buddies tightly knit?
  • PageRank: Is it a big player in the network?
  • Community detection: Using the Louvain method to spot cliques.
  • Sybil connections: What’s the share of transactions to/from known Sybils?

9. Graph Neural Networks (GNNs)

I trained a GraphSAGE model to churn out embeddings from the transaction graph. These capture deep network patterns—like a Sybil’s social circle—that raw stats might miss.

10. Anomaly Scores

An Isolation Forest flagged outliers based on all these features. Weirdos often turn out to be Sybils.

To keep things snappy, I cached these features and ran computations in parallel. No point in reinventing the wheel every time!

Building the Model

Prepping the Data

  • Outlier Cleanup: An Isolation Forest trimmed extreme cases from the training set—less noise, better signal.
  • Scaling: A RobustScaler tamed wild feature values.
  • Balancing Act: Sybils are rare, so I used ADASYN to boost their numbers in training, leveling the playing field.

The Ensemble Dream Team

I went with a stacking classifier—a powerhouse combo of:

  • RandomForestClassifier: Great for spotting patterns.
  • XGBoost, LightGBM, CatBoost: Gradient boosting champs with different flavors.
  • TabNet: A deep learning twist for tabular data.

These base models feed into a final XGBoost layer that ties it all together. I tuned their hyperparameters with Optuna, chasing the best ROC AUC score, and used a stratified split to keep things fair.

Calibration

Raw probabilities can be wonky, so I ran CalibratedClassifierCV with isotonic regression to make them trustworthy—crucial for ranking Sybils accurately.

Results That Speak

I tested the model with ROC AUC and precision-recall curves, tweaking an optimal threshold for classification (though the competition wants probabilities). The final predictions landed in a submission file, and I even plotted a histogram of scores to see how they spread out—complete with that sweet optimal threshold line.

Why It Matters

This model isn’t just a competition entry—it’s a step toward a safer Web3. By mixing blockchain smarts with machine learning muscle, it catches Sybils in the act, protecting decentralized systems from manipulation. From graph embeddings to ensemble magic, every piece is designed to outsmart the fakers.

Thanks for reading! This was a blast to build, and I hope it helps keep the Web3 community thriving.

Sybil Detection with Human Passport and Octant Share (Writeup)

The philosophy behind my method is to keep model simple and take advantage of data source diversity.

Data cleaning

  • Keep one column from strong correlated column set on numerical columns.

  • Remove categorical column with unique element or unique elements equal to data size.

Data aggregation

  • Aggregate numerical column following stats: mean, standard deviation, median, min, max, sum

  • categorical columns: compte unique value

  • add count per aggregate

Transaction data

Aggregate transaction data by FROM_ADDRESS, TO_ADDRESS separately.

Transfer data

Aggregate transfer data by TO_ADDRESS, FROM_ADDRESS, ORIGIN_FROM_ADDRESS, ORIGIN_TO_ADDRESS separately.

Swap data

Aggregate swap data by ORIGIN_FROM_ADDRESS, ORIGIN_TO_ADDRESS separately.

Model construction

  • Ensemble of model relying on xgboost and catboost using 5-kfold cross validation

  • Construct one ensemble per combined network (ethereum/base), data type (transaction/transfer/swap), and aggregated column (TO_ADDRESS/FROM_ADDRESS/ORIGIN_FROM_ADDRESS/ORIGIN_TO_ADDRESS according to data type) following diversity philosophy.

This lead to cross to cross validation with respect of AUC at least equal to 0.94 across ensembles with common value above 0.966

Prediction

  • Get prediction of each ensemble

  • Take mean across available ensembles

  • Take missing prediction with train LABEL mean computed over base and ethereum network

This lead to average cross validation of 0.9664520149563772

Submission

Use available test size of 20369.

:trophy: Sybil Detection

:chart_increasing: 1. Challenge Overview

  • Objective: Identify Sybil wallets on EVM-compatible chains.

:broom: 2. Data Preparation

  • Combined Ethereum and Base datasets (transactions, token transfers, swaps).

  • Standard cleaning: lowercased addresses, datetime parsing, numeric casting.

  • Introduced a global REFERENCE_TIME (May 31, 2025) to ensure time-consistent feature calculation.

:hammer_and_wrench: 3. Feature Engineering

  • Aggregated features at wallet level across:

    • Transactions: value, gas, counterparties, nonce, timing, activity patterns.

    • Token Transfers: cleaned token symbols (via get_cleaned_symbol), amount, diversity, entropy.

    • DEX Swaps: events, platform diversity, amounts, inter-swap timing.

  • Added temporal features (e.g. durations/recency from first/last timestamps to REFERENCE_TIME).

:wrench: 4. Modeling Approach

  • Models: LightGBM (gbdt) and CatBoost blend

  • Seed Averaging: Used 3 random seeds for each model.

Sybil Detection with Human Passport and Octant

Participant: Mahboob Biswas :blush:

Email : mahboobbiswas@gmail.com

Contact : +91 7029232633

Date: May 31, 2025

:white_check_mark: Write-Up for is for my 11th submission.

End-to-End Write-Up for Sybil Detection Model

This write-up provides a comprehensive overview of the Sybil detection model implemented in the sybil_detection.ipynb Jupyter Notebook. The model aims to predict the probability (0 to 1) of a wallet address being a Sybil (a malicious entity creating multiple identities to manipulate a system) using blockchain data. The notebook employs advanced machine learning techniques, including feature engineering, class imbalance handling, hyperparameter tuning, ensemble methods, and model evaluation, to achieve high performance. Below, we explain each cell, its purpose, expected functionality, and anticipated outputs before building the model, followed by a summary of the overall pipeline.


Overview

Purpose: This cell provides an introduction to the Sybil detection model, outlining its objectives and the improvements implemented to enhance performance.

Content:

  • Objective: Predict the probability that a wallet address is a Sybil using historical blockchain data.
  • Improvements:
    • Address overfitting with cross-validation and regularization.
    • Enhance model performance through hyperparameter tuning, SMOTE for class imbalance, and threshold optimization.
    • Analyze feature importance using model-based metrics and SHAP (SHapley Additive exPlanations).
    • Implement ensemble methods like stacking, LightGBM, CatBoost, and voting classifiers.

Import Libraries

Purpose: Import all necessary Python libraries for data processing, modeling, evaluation, and visualization.

Functionality:

  • Imports libraries for:
    • Data manipulation (numpy, pandas).
    • Machine learning models (xgboost, lightgbm, catboost, sklearn).
    • Visualization (matplotlib, seaborn).
    • Model evaluation (sklearn.metrics).
    • Class imbalance handling (imblearn.over_sampling.SMOTE).
    • Feature importance analysis (shap).
    • Utility (os, joblib, datetime, warnings).
  • Suppresses warnings to keep the output clean.

Data Loading

Purpose: Load the blockchain datasets (transactions, DEX swaps, token transfers, and address labels) for training and testing.

Functionality:

  • Sets a random seed (np.random.seed(42)) for reproducibility.
  • Loads parquet files containing:
    • Test addresses: test_addresses_base, test_addresses_ethereum (contain ADDRESS column).
    • Train addresses: train_addresses_base, train_addresses_ethereum (contain ADDRESS and LABEL columns).
    • Feature datasets: Transactions, DEX swaps, and token transfers for both Base and Ethereum chains.
  • Prints dataset information and column names to inspect the data structure.

Feature Engineering

Purpose: Create features from raw blockchain data to capture patterns indicative of Sybil behavior.

Functionality:

  • Defines a function prepare_data_for_xgboost that:
    • Creates a feature DataFrame indexed by unique addresses.
    • Generates transaction-based features: Count, mean value, mean fee, and standard deviation of gas used.
    • Generates temporal features: Transaction time span (days) and frequency per day.
    • Generates swap-based features: Count, mean amounts in/out, and standard deviation of amount out.
    • Generates interaction features: Ratios of fees to value and swap in/out amounts.
    • Generates transfer-based features: Count and mean amount.
    • Creates categorical features: One-hot encoded counts of swap platforms (e.g., platform_uniswap-v3).
    • Fills missing values with medians and scales numerical features using StandardScaler.
    • For training, joins labels and returns features (X) and labels (y); for testing, returns features only.
  • Combines Base and Ethereum data for training and test sets.
  • Saves the feature order to xgb_feature_order.csv for consistency.

Handle Class Imbalance with SMOTE

Purpose: Address the imbalance between Sybil (minority) and non-Sybil (majority) classes using SMOTE (Synthetic Minority Oversampling Technique).

Functionality:

  • Converts labels to numeric, fills NaN with 0, and binarizes them (threshold 0.5).

  • Applies SMOTE to oversample the minority class (Sybil, label 1) to match the majority class (non-Sybil, label 0).

  • Prints the original and resampled class distributions.

    Original class distribution: {0: 99985, 1: 4031}
    Resampled class distribution: {0: 99985, 1: 99985}

Train-Test Split

Purpose: Split the resampled data into training and validation sets.

Functionality:

  • Splits X_resampled and y_resampled into 80% training (X_train, y_train) and 20% validation (X_val, y_val) sets.

  • Uses stratify=y_resampled to maintain class balance.

  • Prints the shapes of the resulting datasets.

  • Example output:

    Training set shape: (159976, 44), Validation set shape: (39994, 44)
    
  • The data is split, with balanced classes in both sets, ready for model training.


Hyperparameter Tuning for XGBoost

Purpose: Optimize the XGBoost model’s hyperparameters using grid search to maximize ROC-AUC.

Code Cell:

xgb_param_grid = {
    'max_depth': [3, 5, 7],
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'lambda': [1, 2],
    'alpha': [0, 1]
}

Functionality:

  • Defines a hyperparameter grid for XGBoost, testing various combinations of depth, estimators, learning rate, subsampling, and regularization parameters.
  • Initializes an XGBClassifier with class weights to handle any residual imbalance.
  • Performs 3-fold cross-validated grid search to maximize ROC-AUC.
  • Fits the model on X_train, y_train.
  • Prints the best parameters and ROC-AUC score.
  • Example output:
    Best XGBoost Parameters: {'alpha': 0, 'colsample_bytree': 1.0, 'lambda': 1, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}
    Best XGBoost ROC-AUC: 0.9992139265200622
    
  • best_xgb contains the optimized XGBoost model, ready for ensemble methods.

Train Other Models

Purpose: Train additional models (Random Forest, LightGBM, CatBoost) with basic hyperparameters for ensemble methods.

Code Cell:

rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)
lgb_model = lgb.LGBMClassifier(
    n_estimators=200,
    max_depth=7,
    learning_rate=0.1,
    is_unbalance=True,
    random_state=42,
    verbose=-1
)
lgb_model.fit(X_train, y_train)
cb_model = cb.CatBoostClassifier(
    iterations=200,
    depth=7,
    learning_rate=0.1,
    auto_class_weights='Balanced',
    verbose=0,
    random_state=42
)

Functionality:

  • Trains a Random Forest with 200 trees, max depth 10, and balanced class weights.
  • Trains a LightGBM model with 200 estimators, max depth 7, and imbalance handling.
  • Trains a CatBoost model with 200 iterations, depth 7, and balanced class weights.
  • Fits all models on X_train, y_train.

Ensemble Methods

Purpose: Implement stacking and voting classifiers to combine predictions from XGBoost, Random Forest, LightGBM, and CatBoost.

Code Cell:

stacking_model = StackingClassifier(
    estimators=[
        ('xgb', best_xgb),
        ('rf', rf_model),
        ('lgb', lgb_model),
        ('cb', cb_model)
    ],
    final_estimator=LogisticRegression(),
    cv=3
)
stacking_model.fit(X_train, y_train)
voting_model = VotingClassifier(
    estimators=[
        ('xgb', best_xgb),
        ('rf', rf_model),
        ('lgb', lgb_model),
        ('cb', cb_model)
    ],
    voting='soft'
)

Functionality:

  • Stacking Classifier:
    • Combines predictions from XGBoost, Random Forest, LightGBM, and CatBoost using 3-fold cross-validation.
    • Uses Logistic Regression as the final estimator to make the final prediction.
  • Voting Classifier:
    • Combines predictions using soft voting (averaging probabilities).
  • Fits both models on X_train, y_train.

Model Evaluation

Purpose: Evaluate the stacking model on the validation set, optimize the classification threshold, and report performance metrics.

Functionality:

  • Computes predicted probabilities for the validation set using the stacking model.
  • Calculates precision, recall, and F1-scores for various thresholds.
  • Selects the threshold that maximizes the F1-score.
  • Generates binary predictions using the best threshold.
  • Prints the classification report (precision, recall, F1-score per class) and ROC-AUC score.

Output:

  • Best Threshold: A value, e.g., 0.5000, optimized for F1-score.
  • Classification Report: Metrics for classes 0 and 1, e.g.:
    precision    recall  f1-score   support
    0       0.99      0.99      0.99     19997
    1       0.99      0.99      0.99     19997
    accuracy                        0.99     39994
    macro avg   0.99      0.99      0.99     39994
    weighted avg   0.99      0.99      0.99     39994
    
  • Validation ROC-AUC: A high score, e.g., 0.9994, indicating excellent discriminative ability.
  • Example output:
    Best threshold: 0.5000
                  precision    recall  f1-score   support
             0       0.99      0.99      0.99     19997
             1       0.99      0.99      0.99     19997
        accuracy                           0.99     39994
       macro avg       0.99      0.99      0.99     39994
    weighted avg       0.99      0.99      0.99     39994
    Validation ROC-AUC: 0.9994
    
  • The model’s performance is evaluated, and the optimal threshold is identified.

Feature Importance Analysis

Purpose: Analyze feature importance using XGBoost’s built-in importance scores and SHAP values.

Functionality:

  • Creates a DataFrame of feature importances from the XGBoost model.

  • Plots a bar chart of the top 20 features using Seaborn.

  • Prints the top 20 features with their importance scores.

  • Uses SHAP to compute and visualize feature contributions for the validation set.

  • Output: A DataFrame listing the top 20 features, we are including only 10 e.g.:

    Top 20 Features:
                feature            importance
    0  tx_time_span_days             0.1500
    1  platform_hashflow-v3          0.1200
    2  tx_count                      0.1000
    3  tx_gas_used_std               0.068686
    4  transfer_count                0.032816
    5  tx_freq_per_day               0.019382
    6  tx_gas_used_std               0.068686
    7  swap_amount_out_usd_mean      0.011656
    8  transfer_amount_usd_mean      0.029811
    9  tx_value_usd_mean             0.032430
    10 tx_fee_mean                   0.054944
    ... 
    
  • SHAP Plot: A bar plot showing SHAP values for the top 20 features, highlighting their impact on predictions.

  • The analysis identifies key features driving Sybil detection, such as temporal and platform-specific behaviors.


Cross-Validation

Purpose: Perform 5-fold cross-validation to assess the stacking model’s robustness.
Code Cell:

Functionality:

  • Performs 5-fold stratified cross-validation on the resampled data.

  • Trains the stacking model on each fold and computes ROC-AUC on the validation fold.

  • Prints the ROC-AUC scores for each fold, along with the mean and standard deviation.

  • Output:

    Cross-Validation ROC-AUC Scores: [0.9989968553291696, 0.9990570846466095, 0.9983119511233182, 0.9990006602205515, 0.9990545813956908]
    Mean CV ROC-AUC: 0.9989, Std: 0.0003
    
  • The cross-validation confirms the model’s robustness across different data splits.


Generate Test Predictions

Purpose: Generate predictions for the test set and create a submission file.

Functionality:

  • Defines a function prepare_test_submission that:

    • Combines test addresses from Base and Ethereum datasets, ensuring uniqueness.
    • Aligns test features with the training feature order (xgb_feature_order.csv).
    • Removes duplicate indices and reindexes to match test addresses.
    • Fills missing values with medians.
    • Predicts probabilities and binary classes using the stacking model and optimized threshold.
    • Creates a submission DataFrame with ADDRESS and PRED (probabilities).
    • Saves the submission to submission_improved.csv.
  • Calls the function and prints the first five rows of the submission.

  • A file submission_improved.csv is created with unique 20,369 rows, containing addresses and their predicted Sybil probabilities.


Conclusion

Purpose: Summarize the model’s performance and suggest future enhancements.

Content:

  • Reports a validation ROC-AUC of 0.9994 (XGBoost) and F1-score of 0.9940 (stacking model).
  • Highlights key factors: SMOTE, robust feature engineering (temporal and platform features), ensemble methods, and overfitting prevention.
  • Notes important features: tx_time_span_days, platform_hashflow-v3.

Next write up :next_track_button:

:white_check_mark: Write-Up for is for my 12th submission.

The sybil-v2.ipynb.zip Jupyter Notebook implements an enhanced machine learning pipeline for Sybil detection in a blockchain context, as part of a competition by Human Passport by Holonym, Octant, and the Ethereum Foundation. The goal is to predict the probability (0 to 1) of a wallet address being a Sybil (fake identity) using historical blockchain data. This write-up provides a comprehensive overview of the model, explaining each cell, outputs, techniques used, and performance metrics, aligning with the outcomes before building the model.


Overview of the Notebook and Objectives

The notebook builds on my previous Sybil detection model by incorporating graph-based features and dynamic thresholding to improve detection capabilities. Sybil detection in blockchain involves identifying fake wallet addresses used to manipulate systems (e.g., airdrops, voting). The model processes blockchain transaction data, extracts features, handles class imbalance, trains multiple models, and uses an ensemble approach to generate predictions.

Objectives

  • Predict the probability (0 to 1) of a wallet being a Sybil.
  • Enhance feature engineering with graph-based features to capture wallet relationships.
  • Implement dynamic thresholding to optimize for multiple metrics (F1-Score, recall, precision-recall AUC).
  • Achieve high ROC-AUC, F1-Score, and recall while minimizing overfitting.

Explanation and Outputs

Import Libraries

  • Purpose: Imports necessary libraries for data processing, modeling, evaluation, visualization, and graph analysis.
  • Key Libraries:
    • Data Processing: pandas, numpy.
    • Models: xgboost, lightgbm, catboost, sklearn (Random Forest, Logistic Regression).
    • Evaluation: sklearn.metrics (ROC-AUC, F1-Score, etc.).
    • Graph Analysis: networkx.
    • Visualization: matplotlib, seaborn, shap.
    • Imbalance Handling: imblearn (SMOTE).

Data Loading

  • Purpose: Loads blockchain datasets and preprocesses the LABEL column in training data to ensure binary labels (0 or 1).
  • Datasets:
    • test_addresses_base, test_addresses_ethereum: Test wallet addresses (ADDRESS).
    • train_addresses_base, train_addresses_ethereum: Training wallet addresses (ADDRESS, LABEL).
    • dex_swaps_base, dex_swaps_ethereum: Swap transactions.
    • token_transfers_base, token_transfers_ethereum: Token transfers.
    • transactions_base, transactions_ethereum: Transaction data.
  • Preprocessing:
    • Converts LABEL to numeric, coercing invalid entries to NaN.
    • Binarizes LABEL using a 0.5 threshold (≥ 0.5 → 1, else 0).
      • 20,369 test addresses, as expected for a competition dataset.

      • 51,515 training addresses, with LABEL initially as object (preprocessed to int).

    • DEX Swaps Base Columns: ['BLOCK_NUMBER', 'BLOCK_TIMESTAMP', 'TX_HASH', ...] (25 columns).
    • Token Transfers Base Columns: ['BLOCK_NUMBER', 'BLOCK_TIMESTAMP', 'TX_HASH', ...] (18 columns).
    • Transactions Base Columns: ['BLOCK_NUMBER', 'BLOCK_TIMESTAMP', 'TX_HASH', ...] (30 columns).

Feature Engineering Overview

  • Purpose: Extracts features from blockchain data, including transaction-based, temporal, swap-based, transfer-based, categorical, and graph-based features.
  • Function: prepare_data_for_xgboost(data, is_train=True)
    • Inputs:
      • data: Dictionary with transactions, dex_swaps, token_transfers, train_addresses/test_addresses.
      • is_train: Boolean indicating training (returns features and labels) or testing (features only).
    • Steps:
      • Transaction Features: tx_count, tx_value_usd_mean, tx_fee_mean, tx_gas_used_std.
      • Temporal Features: tx_time_span_days, tx_freq_per_day.
      • Swap Features: swap_count, swap_amount_in_usd_mean, swap_amount_out_usd_mean, swap_amount_out_usd_std.
      • Interaction Features: tx_fee_to_value_ratio, swap_in_out_ratio.
      • Transfer Features: transfer_count, transfer_amount_usd_mean.
      • Categorical Features: One-hot encodes PLATFORM (e.g., platform_hashflow-v3).
      • Graph-Based Features (Optimized):
        • Samples 100,000 rows from transactions and transfers to reduce graph size.
        • Computes in_degree and out_degree directly from edge lists.
        • Builds a sampled graph and subgraph (top 10,000 nodes by degree).
        • Computes clustering_coefficient and betweenness_centrality on the subgraph.
      • Preprocessing: Fills missing values with medians, scales features with StandardScaler.
    • Output:
      • For training: Features X (DataFrame), labels y (Series).
      • For testing: Features X only.
  • Execution:
    • Processes training and test data, saving feature order to xgb_feature_order.csv.
    • Prints: Warning: Duplicate addresses found in addresses DataFrame. Dropping duplicates.
      • Indicates duplicates in train_addresses (expected due to concatenation), resolved by keeping the first occurrence.
  • Output:
    • Warning about duplicates, confirming data cleaning.
    • Features: ~44-48 columns (original + 4 graph-based: in_degree, out_degree, clustering_coefficient, betweenness_centrality).

Handle Class Imbalance with SMOTE

  • Purpose: Balances the dataset by oversampling the minority class (Sybil) using SMOTE.
  • Steps:
    • Ensures y is binary (0/1) by converting to numeric and binarizing.
    • Applies SMOTE to balance classes.
  • Output:
    • Original class distribution: {0: 96524, 1: 2543}
      Resampled class distribution: {0: 96524, 1: 96524}
      
    • Original: Highly imbalanced (97% non-Sybil, 3% Sybil), as expected for Sybil detection.
    • Resampled: Balanced, confirming SMOTE’s effectiveness.

Train-Test Split

  • Purpose: Splits the resampled data into training and validation sets.
  • Steps:
    • Uses train_test_split with 80:20 ratio, stratified on y_resampled.
  • Output:
    • Training set shape: (154438, 48), Validation set shape: (38610, 48)
      
    • Total samples: 154,438 + 38,610 = 193,048 (matches 96,524 * 2 from SMOTE).
    • Features: 48 columns, confirming addition of graph-based features.

Hyperparameter Tuning for XGBoost

  • Purpose: Tunes XGBoost hyperparameters using GridSearchCV.
  • Steps:
    • Defines a parameter grid (e.g., max_depth, n_estimators, learning_rate).
    • Uses 3-fold cross-validation with ROC-AUC scoring.
  • Output:
    • Best XGBoost Parameters: {'alpha': 0, 'colsample_bytree': 0.8, 'lambda': 1, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 200, 'subsample': 1.0}
      Best XGBoost ROC-AUC: 0.9992139265200622
      
    • Parameters: Balanced settings (max_depth=7, learning_rate=0.1), indicating a deep but regularized model.
    • ROC-AUC: 0.9992, very high, as expected due to strong features and balanced data.

Train Other Models

  • Purpose: Trains Random Forest, LightGBM, and CatBoost models with basic tuning.
  • Steps:
    • Random Forest: n_estimators=200, max_depth=10, class_weight='balanced'.
    • LightGBM: n_estimators=200, max_depth=7, is_unbalance=True, verbose=-1.
    • CatBoost: iterations=200, depth=7, auto_class_weights='Balanced'.

Ensemble Methods

  • Purpose: Implements Stacking and Voting ensembles using the trained models.
  • Steps:
    • Stacking: Combines XGBoost, Random Forest, LightGBM, and CatBoost with a Logistic Regression meta-model.
    • Voting: Uses soft voting across the same models.

Evaluate Models and Check for Overfitting

  • Purpose: Evaluates all models (XGBoost, Random Forest, LightGBM, CatBoost, Stacking, Voting) on training and validation sets.
  • Function: evaluate_model(model, X_train, y_train, X_val, y_val, model_name)
    • Computes ROC-AUC, F1-Score, recall, precision, and overfitting gap.
    • Plots confusion matrices.
  • Output:
    • XGBoost:
      Training ROC-AUC: 0.9997
      Validation ROC-AUC: 0.9993
      Validation F1-Score: 0.9909
      Validation Recall: 0.9950
      Validation Precision: 0.9869
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0004
      
    • Random Forest:
      Training ROC-AUC: 0.9948
      Validation ROC-AUC: 0.9941
      Validation F1-Score: 0.9700
      Validation Recall: 0.9800
      Validation Precision: 0.9603
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0007
      
    • LightGBM:
      Training ROC-AUC: 0.9996
      Validation ROC-AUC: 0.9991
      Validation F1-Score: 0.9901
      Validation Recall: 0.9949
      Validation Precision: 0.9854
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0005
      
    • CatBoost:
      Training ROC-AUC: 0.9987
      Validation ROC-AUC: 0.9980
      Validation F1-Score: 0.9874
      Validation Recall: 0.9933
      Validation Precision: 0.9815
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0007
      
    • Stacking:
      Training ROC-AUC: 0.9997
      Validation ROC-AUC: 0.9986
      Validation F1-Score: 0.9932
      Validation Recall: 0.9961
      Validation Precision: 0.9903
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0011
      
    • Voting:
      Training ROC-AUC: 0.9994
      Validation ROC-AUC: 0.9988
      Validation F1-Score: 0.9886
      Validation Recall: 0.9938
      Validation Precision: 0.9835
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0006
      
    • Confusion matrices plotted for each model.

Dynamic Threshold Optimization

  • Purpose: Optimizes the classification threshold for multiple metrics (F1-Score, recall, precision-recall AUC).
  • Function: optimize_threshold_dynamic(y_true, y_pred_proba, metrics=['f1', 'recall', 'pr_auc'])
    • Optimizes for:
      • f1: Maximizes F1-Score.
      • recall: Maximizes recall with precision ≥ 0.95.
      • pr_auc: Computes Precision-Recall AUC, uses F1-optimal threshold.
    • Plots F1-Score vs. threshold, Recall vs. threshold, and PR curve.
  • Output:
    • Threshold Optimization Results:
      F1: Best Threshold = 0.7751, Score = 0.9939
      RECALL: Best Threshold = 0.7751, Score = 0.9961
      PR_AUC: Best Threshold = 0.7751, Score = 0.9986
      
    • Plots: F1-Score, Recall, and PR curves, showing threshold selection.
    • Best threshold (F1): 0.7751, F1-Score 0.9939, close to original (0.9940).

Feature Importance Analysis

  • Purpose: Analyzes feature importance using XGBoost’s feature_importances_ and SHAP.
  • Output:
    • Top 20 Features:
                        feature  importance
      4          tx_time_span_days    0.643849
      3            tx_gas_used_std    0.068686
      2                tx_fee_mean    0.054944
      12            transfer_count    0.032816
      1          tx_value_usd_mean    0.032430
      13  transfer_amount_usd_mean    0.029811
      5            tx_freq_per_day    0.019382
      27      platform_hashflow-v3    0.017512
      0                   tx_count    0.016956
      6                 swap_count    0.013493
      17         platform_balancer    0.012196
      8   swap_amount_out_usd_mean    0.011656
      7    swap_amount_in_usd_mean    0.008792
      9    swap_amount_out_usd_std    0.007483
      26         platform_hashflow    0.007422
      43            platform_woofi    0.007217
      11         swap_in_out_ratio    0.007030
      22          platform_dexalot    0.004264
      20            platform_curve    0.004062
      31      platform_maverick-v2    0.000000
      
    • SHAP plot: Not shown in output but generated for visual analysis.

Cross-Validation

  • Purpose: Performs 5-fold cross-validation to ensure robust performance.
  • Steps:
    • Uses StratifiedKFold to maintain class balance.
    • Trains the Stacking model on each fold, computes ROC-AUC.
  • Output:
    • Cross-Validation ROC-AUC Scores: [0.999516648229417, 0.999653417270615, 0.9992871301288809, 0.999347305662059, 0.9993718291117792]
      Mean CV ROC-AUC: 0.9994, Std: 0.0001
      
    • Mean ROC-AUC 0.9994, Std 0.0001, indicating high performance and consistency.

Generate Test Predictions

  • Purpose: Generates predictions for test addresses using the Stacking model and optimized threshold.
  • Function: prepare_test_submission(model, test_features, test_addresses_list, threshold, output_file)
    • Aligns test features, predicts probabilities, applies threshold, saves submission.
    • Submission saved to ‘submission_improved_2.csv’
    • 20,369 unique test addresses

Techniques and Methods for High Accuracy

Data Preprocessing

  • Label Binarization: Ensured LABEL is binary (0/1), handling continuous or invalid values.
  • SMOTE: Balanced classes (from 96,524:2,543 to 96,524:96,524), mitigating bias toward non-Sybil.
  • Feature Scaling: Used StandardScaler to normalize features, improving model convergence.

Feature Engineering

  • Temporal Features: tx_time_span_days (0.6438 importance) captures Sybil behavior (short activity spans).
  • Graph-Based Features: Added in_degree, out_degree, clustering_coefficient, betweenness_centrality to identify network patterns, though their importance was lower than expected due to sampling.
  • Platform Features: platform_hashflow-v3 (0.0175) highlights platform-specific Sybil behavior.

Model Training

  • XGBoost: Achieved ROC-AUC 0.9993 with tuned parameters (max_depth=7, learning_rate=0.1).
  • Ensemble (Stacking): Combined models, achieving F1-Score 0.9932 and recall 0.9961, leveraging diverse strengths.

Dynamic Thresholding

  • Optimized for F1-Score (0.9939), recall (0.9961), and PR-AUC (0.9986), ensuring flexibility in metric prioritization.

Overfitting Handling

  • Small Gaps: Overfitting gap < 0.001 for most models (e.g., XGBoost 0.0004), indicating minimal overfitting.
  • Cross-Validation: Std 0.0001 confirms consistency.
  • Regularization: Used in XGBoost (lambda=1), Random Forest (max_depth=10), and other models.

Feature Importance

  • Top Features:
    • tx_time_span_days (0.6438): Dominant, as expected.
    • tx_gas_used_std (0.0687), tx_fee_mean (0.0549): Indicate irregular transaction patterns.
    • Graph Features: Not in top 20, possibly due to sampling; may improve with larger subgraphs.

Conclusion

The model achieves exceptional performance (ROC-AUC 0.9994, F1-Score 0.9939, recall 0.9961), meeting expectations for Sybil detection. Graph-based features and dynamic thresholding enhance detection, though further optimization (e.g., larger graph samples) could improve graph feature importance. The model is robust, consistent, and ready for submission to the competition leaderboard.