Write Up for Models Predicting Sybil Scores of Wallets

What a great thread this is. Thanks to all contributors. I don’t have a writeup, but want to bring your attention to a new app available in the bot-proofing and sybil-detection space: CUBID Protocol which takes a lot of inspiration from the old Gitcoin Passport and introduces a several key improvements:

  • email signup and login, as opposed to metamask
  • ability to add identity proofs from multiple chains
  • app-scoped identities, preventing user tracking across apps or networks
  • optional embedded app-scoped EVM accounts creation (abstracted wallets)
  • SDK with embedded UI componentry (no need to send users to external site for identity management)
  • user-controlled full or partial sharing of their identity, or only sharing the score
  • and more

Feel free to try it out. Embedding CUBID into your app is just a matter of a few lines of code. Also, while it already works: we are still very much in development and looking for devs to help out.

1 Like

yes it’s expected the AUC score as the term implies Area under the curve basically measures how correctly your model was able to distinguise between sybil and non-sybil for this to happen you need to have the correct labelled dataset, the AUC score you got while training as such won’t be the same with the test beause it’s a new set of addresses and only the hosts of the competition currectly have the completely labelled dataset against which to check how correctly your model was able top predict, so when they check it, they’ll let you know how well your model’s prediction on the test address was able to correctly labell sybil vs non sybil
don’t know if that was clear enough

1 Like

yeah, i totally understand you, i was actually refering to the fact that the train and test features after data preparation have diffrent distributions

Sybil Detection Model – Competition Writeup

Pond AI Platform name: Limonada

In this competition, I built a machine learning pipeline to detect Sybil wallets using behavioral features extracted from Ethereum and Base chain data.

Data Overview

  • Addresses: From Ethereum and Base chain data.

  • Transactions: Regular transactions, token transfers, and DEX swaps.

  • I merged and standardized data from both chains to enable consistent feature generation.

Feature Engineering

I created features through different groupings and aggregations over the transactional data. These behavioral metrics helped highlight subtle differences between Sybil and legitimate users.

Handling Class Imbalance

Since the dataset was imbalanced, I experimented with several undersampling techniques to improve learning:

  • NearMiss (various versions)

  • TomekLinks

These approaches helped in exploring different class balancing strategies.

Model & Training

To address the classification challenge, I focused on building a robust stacked ensemble model. My strategy was to combine several diverse base models and use a meta-model to learn how to best integrate their predictions.

Modeling Approach

Base Models Used:

  • Random Forest (with tuned hyperparameters)

  • XGBoost

  • LightGBM

  • Logistic Regression

  • SVC (with probability outputs)

  • Multi-Layer Perceptron (MLP)

Each base model was trained using 5-fold Stratified Cross-Validation, generating:

  • Out-of-fold (OOF) predictions for the training set

  • Averaged predictions for the test set

Stacking Ensemble:

I fed the OOF predictions from all base models into a second-layer model (meta-model). After experimenting with different options, I selected the best performers based on cross-validated AUC.

Meta-Models Used:

  • RidgeClassifierCV

  • XGBoost

Evaluation Metric:

  • ROC AUC was used throughout to measure performance and guide model selection.

Results

The final ensemble achieved an AUC of 0.998, demonstrating near-perfect ability to distinguish Sybil wallets from real users.

Closing Thoughts

Blending blockchain intuition with behavioral modeling proved powerful. The ensemble was able to capture subtle but consistent patterns in wallet activity that traditional heuristics might miss.

Sybil Detection Challenge

Hello folks! Here are some details about the approach I took on the Sybil Detection Challenge.
It was a great contest and made me explore new techniques I wasn’t familiar with (e.g: node2vec).

I’ll be sharing the path instead of only the “final” setup as I think it’s interesting in this case.
Before jumping into that, here are some of the most interesting reads I’ve found while researching for this competition.

Resources

Improving the Baseline

My first step was to get a baseline model going.
It’s really useful to have an end to end thing working as soon as possible to make iterations easier and faster.
I then submitted a couple of dummy predictions (all preds to 0.5) just to make sure they worked.
Once the model was setup, I added some stats from the provided datasets.
Simple aggregations like number of transactions, total value in/out, …

Adding these features and changing the model to a Random Forest started producing ROC AUC scores around 0.98 on a 5 fold local cross validation.
This is interesting as it means the models are able to learn really well the training dataset.
The results on the leaderboard data where different, though (ROC AUC around 0.8).
That meant the test set has a different distribution of sybil wallets (training is around 3% while test might be around 10%).

While exploring the test set to verify that hunch, I continued adding all the features I could think of from the provided datasets.

Adding Features

I added more aggregations at different levels (transaction, network, …), and also for the different wallets (from will produce aggregations for senders and to for receivers). Doing this over most of the column resulted in around 300 features. Some of the most interesting ones:

  • When that wallet received its first and last transactions
  • Number of unique tokens it used
  • How many wallets have interacted with it and with how many wallet has it interacted
  • Data about the address it funded the wallet (label, value of first tx, id, …). The goal was to get information about the founding event.

Adding these features raised the ROC AUC score to 0.9904, which confirmed the initial suspition that the training data didn’t contain much more predictive power.

Nonetheless, I continue adding more features. Many of them were hand crafted features based on intuition (ratios, activity metrics, …) but also spent a large amount of time trying to derive useful features from the interaction graph that these wallet form.

Once I constructed the graph, I was able to extract many interesting new features:

  • Classic graph metrics like Degree, PakeRank, Centrality, Cluster, …
  • Levain community clusters and it’s population size
  • An embedding (64 values) of each wallet in the graph

This turned out to be important. Specially since it gives use more values we can “Target Encode”.

The important one is the “community” and the “degree”. Basically, you’re giving the model the average sybil-ness of it’s communiy and of people that interacted with the same number of wallets.

This moved the score to 0.9984.

I also explored these services’ APIs with the hope of augmenting the training data. No luck there as they were very limited in the number of requests I could make.

Finally, I found a bunch of CSVs online that contained a list of sybil wallets. I joined them to our training data and computed the average sybil-ness of the “potentially sybil” wallets. Since these weren’t 100% hits (the online lists have many false positives), I created a new feature that was basically if I did find wallet X on any of the lists.

Cleaning Data

Another thing I realized while testing out thigns is that the training data contained some addresses that were contracts.

And seems the same thing happens within the test dataset.

That means that, if we get a list of contracts, we can mark these as “non sybil”. It’ll make our training dataset cleaner and ensure some accurate predictions in tests.

I got the data from a couple of Dune/Flipside queries, joined it, and did some lightweight preprocessing.

Also added some extra cleaning steps like removing columns with only null values and removing features with low variance.

Later one, I updated some of the training labels based on Ben2k’s false negatives and an Out Of Fold prediction with the best model.

Improving the Models

Since the start, I knew my model had a few problems:

  • Too many columns.
  • I was discarding the Datetime Features.
  • Categorical features where being label encoded instead of OOE.

I spent some time improving the model pipeline to fix that. Added a feature selection step to keep only 100 features based on importance. Processed properly the datetime and categorical features. And, finally, moved to LGBM instead of Random Forests.

With all of this the local ROC AUC was of 0.99912. I’m sure there are a few more things that can be done (like subsampling with NearMiss or similar) but didn’t want to spend more time trying out things if the score was already that close to 1.

How could I check the generalizability of this approach? Well, there is another Pond competition that has the same exact training data.

I downloaded it, changed the read_csv() to these files, run the model with predict instead of predict_proba and sent the submission (I had sent a dummy one earlier to check the shape). The result was promising.

Since the evaluation windows were taking their time. I started working on a different approach and changed the shape of the problem to “Event Sequence Classification”. That means I would build a model that took a bunch of events (sent tx, swapped, sent another tx) and categorize them as sybil or not. Unfortunately, I wasn’t able to finish with this approach as I got stuck figuring out the metadata and how to make a model understand that. This is something companies do a lot, for example, when predicting a trial conversion or if an user will churn.

Postprocessing

The postprocessing step was simple yet effective. Since I have a list of labeled contracts from Flipside, I mark them as non-sybil directly!

The last submission I sent is also an ensemble of the previous 20 ones (where I tried all sorts of feature combinations and models)!

Conclussion

Excited to know the final results as the final dataset is quite different from the initial one. I’m sure there are a bunch of things I missed as I didn’t explore this second training dataset as much as I would have liked.

Another side-learning is that these kind of competitions will train models that overfit the sybil types that have been seen in the training data. If a sybil wallet is on the training data but not marked as sybil, the model will learn to classify it as non-sybil and the model will be trained to ignore these. Here, more than in other type of competitions, is where we need a plurality of models, each trained with different approaches and datasets.

2 Likes

Data Collection and Preprocessing

  • Data from both chains (base and ethereum) were concatenated to create unified training and test datasets.

  • Merging operations were conducted using:

    • Address (ADDRESS) ↔ FROM_ADDRESS from transactions
    • Address ↔ ORIGIN_FROM_ADDRESS from token transfers

Cleaning

  • Duplicates were removed using drop_duplicates(subset='ADDRESS') to ensure address uniqueness.
  • Only the first occurrence of each address in merged datasets was retained.

Feature Engineering

A custom function build_merged_df was used to engineer rich, temporal and transaction-based features from DEX swaps. Features included:

  • Number of unique active days
  • Time since last transaction
  • Transaction period span in days
  • Total USD value of tokens swapped in/out
  • Average time between swaps
  • Count of transactions per address

Additional features from transaction and token transfer data were included:

  • Gas fees, values, function signatures, raw token metrics
  • Categorical fields like SYMBOL, PLATFORM, TOKEN_OUT

Modeling Strategy

A Stratified K-Fold Cross-Validation (5 folds) strategy was adopted to ensure class distribution balance across splits.

Models Used

Four machine learning models were trained and evaluated using ROC AUC score as the performance metric:

  1. LightGBM
  2. XGBoost
  3. CatBoost
  4. HistGradientBoostingClassifier

Test Data Preprocessing

Test data underwent the same transformation and feature engineering pipeline as the training data.

Prediction

  • Predictions were obtained from each model.
  • A simple average ensemble of the 4 model probabilities was used to generate the final prediction.

ensemble_preds = (xgb_preds + lgb_preds + cat_preds + hgb_preds) / 4

Sybil Detection Model Writeup: Crypto Pond Competition

Participant: Stakeridoo
Email : stakeridoo@gmail.com
Telegram: Telegram: Contact @ben2k_stakeridoo
X: http://x.com/stakeridoo

Introduction

Imagine a protocol airdrop worth millions — and a single actor claiming thousands of shares using fake identities. Sybil attacks aren’t just theoretical risks; they’re recurring exploit vectors that undermine decentralization at its core. Detecting them is not simply a machine learning problem, but a matter of understanding coordination, incentives, and deception on-chain. This project reflects years of real-world analysis distilled into a practical pipeline that blends custom SQL heuristics, feature-rich LightGBM modeling, and a careful balance between precision and recall.


Data Sources & Preprocessing


Full SQL Dune Dashboard available here

The project leveraged both the Base and Ethereum datasets provided by the organizers, and three tailored SQL-derived sources:

  • Funding Analysis: Focused on initial ETH inflows to establish early clustering and synchronized deployment patterns.
  • Wallet Lifecycle Profiling: Captured first/last transaction times and frequency patterns, chain-specific.
  • Peer Transfers Only: Constructed exclusively from transfers between known addresses, deliberately excluding unfiltered inflow/outflow events, particularly CEX deposit sources.

These SQL-enhanced datasets were designed to reduce false positives by eliminating noise from non-relevant financial activity, such as CEX inflows, and to sharpen contrast between normal users and orchestrated Sybil structures. For example, the exclusion of non-peer transfers prevented misleading clusters that would have otherwise diluted detection accuracy.

The training dataset contains relatively few Sybils overall compared to the full dataset:

  • Ethereum train set: 52,501 addresses total, of which ~8,827 are labeled as Sybil (~16.8%)
  • Base train set: 51,515 addresses total, of which ~9,936 are labeled as Sybil (~19.3%)

This imbalance makes precision and effective false positive suppression all the more critical.
I identified mislabeling and omissions in the training set, particularly false negatives. My analysis flagged several wallet groups with clustered activity, suspicious funding synchronicity, and shared behavioral patterns that were previously unmarked. This includes influential edge cases where legitimate users might be misclassified as Sybil due to atypical activity.

Validation Case: The test set includes the wallet 0xf4b0556b9b6f53e00a1fdd2b0478ce841991d8fa , also known as Olimpio, a widely known airdrop influencer. While highly active and cross-chain present, Olimpio is not a Sybil — yet provides a perfect benchmark to test for false positives in high-activity profiles. The model correctly refrains from mislabeling it, validating its robustness. Such edge cases serve as a qualitative validation layer and help benchmark the model’s caution versus aggressiveness.


Methodological Philosophy

Sybil detection is a precision craft. Off-the-shelf classification techniques often conflate noise with signal, especially when features are shallow and uncurated. My philosophy leans on three core principles:

  • Prune noise at the source: Using SQL logic to remove irrelevant CEX flows and reduce cluster confusion.
  • Feature quality > quantity: Focused, behaviorally-motivated attributes rooted in years of observing Sybil farming, airdrop abuse, and DAO manipulation.
  • Validate with edge cases: Ensure real users like Olimpio or campaign overperformers are not falsely flagged.

The internals of my feature design are intentionally undisclosed. What I will share is that they emerged not from this competition but from a longer journey across DAO governance, bridge farming, identity farming, and network graph studies. This “secret sauce” is built on real airdrop abuse detection, not theory. You might think of it as a forensic lens — tuned to subtle patterns invisible to raw metrics — leveraging timing asymmetries, funding fingerprints, and chain-specific behavioral nuances. Certain constructions — involving funding path alignment, coordination delay gaps, and meta-cluster overlap — are subtle, proprietary, and designed to remain stealthy.


Pipeline Overview

  1. Load Data: Dexswaps from Base & Ethereum + SQL-enhanced auxiliary features
  2. False Negative Identification: Heuristic sweep on train using historical clustering
  3. Label Normalization: Consolidation, deduplication, cast
  4. Feature Generation (internally structured into subscores):
  • Activity-based
  • Funding pattern-based
  • Network proximity (via direct ERC20/native transfers)
  • DEX usage patterns (from Parquet)
  1. Graph Metrics: pagerank , betweenness , clustering , two_hop_pathing
  2. Modeling: LightGBM with scale_pos_weight, tuned via CV
  3. Calibration: Isotonic regression with CalibratedClassifierCV
  4. Submission Creation: Decision threshold = 0.5

Feature Importance

Top-ranked features included transaction frequency, funding source diversity, and graph proximity metrics. These were consistently impactful in identifying both subtle and overt Sybil coordination patterns, especially when combined into normalized subscore groups.

My model’s core relies on several composite scores. Each one synthesizes multiple dimensions:

  • Lifecycle depth (age, frequency, recency)
  • Funding velocity (delay after creation, shared funding roots)
  • Graph centrality (proximity to known Sybils or funding centers)
  • Economic behavior (DEX usage under real vs. fake cost profiles)

These features outperform simpler metrics like raw count or USD transferred. Graph augmentation was particularly effective in filtering ambiguous addresses otherwise passed as clean.


Score Distribution

  • Mean Score: ~0.30
  • Median Score: ~0.15
  • ≥0.50 (classified Sybil): 4,776 / 20,369
  • ≥0.95 (high confidence Sybil): 2,895

A strong bimodal distribution reflects the model’s ability to separate clearly between likely Sybils and legitimate users — reducing edge-case ambiguity and streamlining operational decisions.


Evaluation Metrics

Metric Training Validation
AUC 0.9564 0.9528
Binary Logloss 0.1963
Early Stopping Round 96 96
F1 Score 0.7181
Number of False Positives 3,086 31
Number of False Negatives 272 210
False Positive Rate 4.00% 0.16%
False Negative Rate 13.15% 40.62%
True Positive Rate 86.85% 59.38%
True Negative Rate 96.00% 99.84%

The relatively high false negative rate in the validation split is attributable to the fragmentation of large Sybil clusters across train/test boundaries. When a cluster is broken up during an 80/20 split, the model may see only partial context during training, resulting in some Sybil addresses being missed. This is amplified by the low overall share of labeled Sybils in the dataset.

Despite this, the model preserves an extremely low false positive rate, underscoring its cautious design philosophy — a critical trait for any system meant for production deployment where trust and integrity matter.


Final Remarks

This submission is the result of a matured process — not just a machine learning pipeline, but an investigative framework built to scale with adversarial creativity. Rather than chasing leaderboard overfitting, I’ve focused on building a system grounded in clarity, robustness, and practical deployment.

I approached this challenge not merely as a data scientist but as a long-time Sybil hunter. Over the past years, I’ve tracked airdrop farming rings, studied bridge farming, and analyzed DAO governance patterns. The patterns I learned — and the SQL queries and Python scripts honed along the way — formed the backbone of this approach.

My tooling isn’t generic, nor is it meant to be. It’s the result of deeply specialized use cases, gradually refined to minimize false positives without losing structural insight into deceptive coordination. From carefully filtered ERC20 funding paths to DEX-based behavioral modeling, each layer builds on a forensic mindset.

The edge is not in the code, but in knowing where to look.

This is why the inclusion of Olimpio in the test set felt so serendipitous. As a high-profile, high-activity participant with atypical transaction behavior, his address served as the ultimate test case for avoiding false positives. That the model passed this benchmark underscores the importance of building with domain awareness — not just statistical rigor.

What you see in the CSV is the output of a trained classifier. But what it represents is more than code — it’s a philosophy of defensive design. My system was built for clarity under pressure, not leaderboard aesthetics. That’s the distinction that enables real-world deployment. The maxim holds true here: you don’t win against Sybils by being fast — you win by being right. This is what separates a good model from a truly deployable one.

pond username: Oleh RCL
model: located on pond competition submission or can be provided as github repository.

date: 30/05/2025

Sybil Detection: Securing Web3 with Machine Learning and Blockchain Insights

The Challenge of Sybil Attacks

In the decentralized world of Web3, Sybil attacks—where bad actors spin up multiple fake identities—are a real headache. They can skew airdrops, hijack governance votes, or drain funding systems meant to support genuine users. The mission here was to build a model that sniffs out these sneaky Sybil wallets using historical blockchain data. My solution blends advanced feature engineering, graph analysis, and a beefy ensemble of machine learning models to tackle this head-on.

The Data Playground

I worked with a rich mix of datasets to fuel this model. We’ve got labeled data with around 2,500 known Sybil addresses, pulled from heavy hitters like Gitcoin Passport, LayerZero, zkSync, OP, Octant, and Gitcoin’s own ban lists. Then there’s the raw blockchain action: transaction records, token transfers, and DEX swaps from both the Base and Ethereum chains. To spice things up, I folded in a list of potential false negatives from Ben2k—think of it as a cheat sheet to catch Sybils that might’ve slipped through the cracks.

Crafting Features: The Heart of Detection

Good features are the secret sauce of any machine learning model, and I went all out here. I engineered a slew of features to capture the quirks of Sybil behavior, optimized for speed and insight. Here’s the rundown:

1. Temporal Features

  • Time since first transaction: How long has this wallet been around?
  • Mean and variance of transaction gaps: Are transactions steady or all over the place?
  • Transactions per hour: How busy is this address?
  • Burstiness: Spotting sudden flurries of activity—Sybils often can’t help but overdo it.

2. Transaction Counts

  • Total transactions: Raw activity level.
  • Transactions per day: Normalized to see if it’s a slow burner or a hyperactive spammer.

3. Velocity

  • Transaction velocity: Total value moved divided by active days. Fast movers might be up to no good.

4. Token Diversity

  • Unique tokens: Is this wallet a jack-of-all-trades or obsessed with one token? Sybils often farm specific airdrops.

5. Chain Preferences

  • Base transaction ratio: How much action happens on Base versus Ethereum? Sybils might lean one way.

6. Swap Behavior

  • Ethereum-to-Base swap ratio: For DEX users, this tracks cross-chain habits—Sybils might show odd patterns.

7. Entropy

  • Token entropy: How random are the tokens they touch?
  • Hourly entropy: Are transactions timed like clockwork or chaotic? Bots tend to be predictable.

8. Graph Features

I built a transaction graph with NetworkX and pulled out:

  • In-degree/out-degree: How many wallets send to or receive from this one?
  • Clustering coefficient: Are its buddies tightly knit?
  • PageRank: Is it a big player in the network?
  • Community detection: Using the Louvain method to spot cliques.
  • Sybil connections: What’s the share of transactions to/from known Sybils?

9. Graph Neural Networks (GNNs)

I trained a GraphSAGE model to churn out embeddings from the transaction graph. These capture deep network patterns—like a Sybil’s social circle—that raw stats might miss.

10. Anomaly Scores

An Isolation Forest flagged outliers based on all these features. Weirdos often turn out to be Sybils.

To keep things snappy, I cached these features and ran computations in parallel. No point in reinventing the wheel every time!

Building the Model

Prepping the Data

  • Outlier Cleanup: An Isolation Forest trimmed extreme cases from the training set—less noise, better signal.
  • Scaling: A RobustScaler tamed wild feature values.
  • Balancing Act: Sybils are rare, so I used ADASYN to boost their numbers in training, leveling the playing field.

The Ensemble Dream Team

I went with a stacking classifier—a powerhouse combo of:

  • RandomForestClassifier: Great for spotting patterns.
  • XGBoost, LightGBM, CatBoost: Gradient boosting champs with different flavors.
  • TabNet: A deep learning twist for tabular data.

These base models feed into a final XGBoost layer that ties it all together. I tuned their hyperparameters with Optuna, chasing the best ROC AUC score, and used a stratified split to keep things fair.

Calibration

Raw probabilities can be wonky, so I ran CalibratedClassifierCV with isotonic regression to make them trustworthy—crucial for ranking Sybils accurately.

Results That Speak

I tested the model with ROC AUC and precision-recall curves, tweaking an optimal threshold for classification (though the competition wants probabilities). The final predictions landed in a submission file, and I even plotted a histogram of scores to see how they spread out—complete with that sweet optimal threshold line.

Why It Matters

This model isn’t just a competition entry—it’s a step toward a safer Web3. By mixing blockchain smarts with machine learning muscle, it catches Sybils in the act, protecting decentralized systems from manipulation. From graph embeddings to ensemble magic, every piece is designed to outsmart the fakers.

Thanks for reading! This was a blast to build, and I hope it helps keep the Web3 community thriving.

Sybil Detection with Human Passport and Octant Share (Writeup)

The philosophy behind my method is to keep model simple and take advantage of data source diversity.

Data cleaning

  • Keep one column from strong correlated column set on numerical columns.

  • Remove categorical column with unique element or unique elements equal to data size.

Data aggregation

  • Aggregate numerical column following stats: mean, standard deviation, median, min, max, sum

  • categorical columns: compte unique value

  • add count per aggregate

Transaction data

Aggregate transaction data by FROM_ADDRESS, TO_ADDRESS separately.

Transfer data

Aggregate transfer data by TO_ADDRESS, FROM_ADDRESS, ORIGIN_FROM_ADDRESS, ORIGIN_TO_ADDRESS separately.

Swap data

Aggregate swap data by ORIGIN_FROM_ADDRESS, ORIGIN_TO_ADDRESS separately.

Model construction

  • Ensemble of model relying on xgboost and catboost using 5-kfold cross validation

  • Construct one ensemble per combined network (ethereum/base), data type (transaction/transfer/swap), and aggregated column (TO_ADDRESS/FROM_ADDRESS/ORIGIN_FROM_ADDRESS/ORIGIN_TO_ADDRESS according to data type) following diversity philosophy.

This lead to cross to cross validation with respect of AUC at least equal to 0.94 across ensembles with common value above 0.966

Prediction

  • Get prediction of each ensemble

  • Take mean across available ensembles

  • Take missing prediction with train LABEL mean computed over base and ethereum network

This lead to average cross validation of 0.9664520149563772

Submission

Use available test size of 20369.

:trophy: Sybil Detection

:chart_increasing: 1. Challenge Overview

  • Objective: Identify Sybil wallets on EVM-compatible chains.

:broom: 2. Data Preparation

  • Combined Ethereum and Base datasets (transactions, token transfers, swaps).

  • Standard cleaning: lowercased addresses, datetime parsing, numeric casting.

  • Introduced a global REFERENCE_TIME (May 31, 2025) to ensure time-consistent feature calculation.

:hammer_and_wrench: 3. Feature Engineering

  • Aggregated features at wallet level across:

    • Transactions: value, gas, counterparties, nonce, timing, activity patterns.

    • Token Transfers: cleaned token symbols (via get_cleaned_symbol), amount, diversity, entropy.

    • DEX Swaps: events, platform diversity, amounts, inter-swap timing.

  • Added temporal features (e.g. durations/recency from first/last timestamps to REFERENCE_TIME).

:wrench: 4. Modeling Approach

  • Models: LightGBM (gbdt) and CatBoost blend

  • Seed Averaging: Used 3 random seeds for each model.

Sybil Detection with Human Passport and Octant

Participant: Mahboob Biswas :blush:

Email : mahboobbiswas@gmail.com

Contact : +91 7029232633

Date: May 31, 2025

:white_check_mark: Write-Up for is for my 11th submission.

End-to-End Write-Up for Sybil Detection Model

This write-up provides a comprehensive overview of the Sybil detection model implemented in the sybil_detection.ipynb Jupyter Notebook. The model aims to predict the probability (0 to 1) of a wallet address being a Sybil (a malicious entity creating multiple identities to manipulate a system) using blockchain data. The notebook employs advanced machine learning techniques, including feature engineering, class imbalance handling, hyperparameter tuning, ensemble methods, and model evaluation, to achieve high performance. Below, we explain each cell, its purpose, expected functionality, and anticipated outputs before building the model, followed by a summary of the overall pipeline.


Overview

Purpose: This cell provides an introduction to the Sybil detection model, outlining its objectives and the improvements implemented to enhance performance.

Content:

  • Objective: Predict the probability that a wallet address is a Sybil using historical blockchain data.
  • Improvements:
    • Address overfitting with cross-validation and regularization.
    • Enhance model performance through hyperparameter tuning, SMOTE for class imbalance, and threshold optimization.
    • Analyze feature importance using model-based metrics and SHAP (SHapley Additive exPlanations).
    • Implement ensemble methods like stacking, LightGBM, CatBoost, and voting classifiers.

Import Libraries

Purpose: Import all necessary Python libraries for data processing, modeling, evaluation, and visualization.

Functionality:

  • Imports libraries for:
    • Data manipulation (numpy, pandas).
    • Machine learning models (xgboost, lightgbm, catboost, sklearn).
    • Visualization (matplotlib, seaborn).
    • Model evaluation (sklearn.metrics).
    • Class imbalance handling (imblearn.over_sampling.SMOTE).
    • Feature importance analysis (shap).
    • Utility (os, joblib, datetime, warnings).
  • Suppresses warnings to keep the output clean.

Data Loading

Purpose: Load the blockchain datasets (transactions, DEX swaps, token transfers, and address labels) for training and testing.

Functionality:

  • Sets a random seed (np.random.seed(42)) for reproducibility.
  • Loads parquet files containing:
    • Test addresses: test_addresses_base, test_addresses_ethereum (contain ADDRESS column).
    • Train addresses: train_addresses_base, train_addresses_ethereum (contain ADDRESS and LABEL columns).
    • Feature datasets: Transactions, DEX swaps, and token transfers for both Base and Ethereum chains.
  • Prints dataset information and column names to inspect the data structure.

Feature Engineering

Purpose: Create features from raw blockchain data to capture patterns indicative of Sybil behavior.

Functionality:

  • Defines a function prepare_data_for_xgboost that:
    • Creates a feature DataFrame indexed by unique addresses.
    • Generates transaction-based features: Count, mean value, mean fee, and standard deviation of gas used.
    • Generates temporal features: Transaction time span (days) and frequency per day.
    • Generates swap-based features: Count, mean amounts in/out, and standard deviation of amount out.
    • Generates interaction features: Ratios of fees to value and swap in/out amounts.
    • Generates transfer-based features: Count and mean amount.
    • Creates categorical features: One-hot encoded counts of swap platforms (e.g., platform_uniswap-v3).
    • Fills missing values with medians and scales numerical features using StandardScaler.
    • For training, joins labels and returns features (X) and labels (y); for testing, returns features only.
  • Combines Base and Ethereum data for training and test sets.
  • Saves the feature order to xgb_feature_order.csv for consistency.

Handle Class Imbalance with SMOTE

Purpose: Address the imbalance between Sybil (minority) and non-Sybil (majority) classes using SMOTE (Synthetic Minority Oversampling Technique).

Functionality:

  • Converts labels to numeric, fills NaN with 0, and binarizes them (threshold 0.5).

  • Applies SMOTE to oversample the minority class (Sybil, label 1) to match the majority class (non-Sybil, label 0).

  • Prints the original and resampled class distributions.

    Original class distribution: {0: 99985, 1: 4031}
    Resampled class distribution: {0: 99985, 1: 99985}

Train-Test Split

Purpose: Split the resampled data into training and validation sets.

Functionality:

  • Splits X_resampled and y_resampled into 80% training (X_train, y_train) and 20% validation (X_val, y_val) sets.

  • Uses stratify=y_resampled to maintain class balance.

  • Prints the shapes of the resulting datasets.

  • Example output:

    Training set shape: (159976, 44), Validation set shape: (39994, 44)
    
  • The data is split, with balanced classes in both sets, ready for model training.


Hyperparameter Tuning for XGBoost

Purpose: Optimize the XGBoost model’s hyperparameters using grid search to maximize ROC-AUC.

Code Cell:

xgb_param_grid = {
    'max_depth': [3, 5, 7],
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'lambda': [1, 2],
    'alpha': [0, 1]
}

Functionality:

  • Defines a hyperparameter grid for XGBoost, testing various combinations of depth, estimators, learning rate, subsampling, and regularization parameters.
  • Initializes an XGBClassifier with class weights to handle any residual imbalance.
  • Performs 3-fold cross-validated grid search to maximize ROC-AUC.
  • Fits the model on X_train, y_train.
  • Prints the best parameters and ROC-AUC score.
  • Example output:
    Best XGBoost Parameters: {'alpha': 0, 'colsample_bytree': 1.0, 'lambda': 1, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}
    Best XGBoost ROC-AUC: 0.9992139265200622
    
  • best_xgb contains the optimized XGBoost model, ready for ensemble methods.

Train Other Models

Purpose: Train additional models (Random Forest, LightGBM, CatBoost) with basic hyperparameters for ensemble methods.

Code Cell:

rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)
lgb_model = lgb.LGBMClassifier(
    n_estimators=200,
    max_depth=7,
    learning_rate=0.1,
    is_unbalance=True,
    random_state=42,
    verbose=-1
)
lgb_model.fit(X_train, y_train)
cb_model = cb.CatBoostClassifier(
    iterations=200,
    depth=7,
    learning_rate=0.1,
    auto_class_weights='Balanced',
    verbose=0,
    random_state=42
)

Functionality:

  • Trains a Random Forest with 200 trees, max depth 10, and balanced class weights.
  • Trains a LightGBM model with 200 estimators, max depth 7, and imbalance handling.
  • Trains a CatBoost model with 200 iterations, depth 7, and balanced class weights.
  • Fits all models on X_train, y_train.

Ensemble Methods

Purpose: Implement stacking and voting classifiers to combine predictions from XGBoost, Random Forest, LightGBM, and CatBoost.

Code Cell:

stacking_model = StackingClassifier(
    estimators=[
        ('xgb', best_xgb),
        ('rf', rf_model),
        ('lgb', lgb_model),
        ('cb', cb_model)
    ],
    final_estimator=LogisticRegression(),
    cv=3
)
stacking_model.fit(X_train, y_train)
voting_model = VotingClassifier(
    estimators=[
        ('xgb', best_xgb),
        ('rf', rf_model),
        ('lgb', lgb_model),
        ('cb', cb_model)
    ],
    voting='soft'
)

Functionality:

  • Stacking Classifier:
    • Combines predictions from XGBoost, Random Forest, LightGBM, and CatBoost using 3-fold cross-validation.
    • Uses Logistic Regression as the final estimator to make the final prediction.
  • Voting Classifier:
    • Combines predictions using soft voting (averaging probabilities).
  • Fits both models on X_train, y_train.

Model Evaluation

Purpose: Evaluate the stacking model on the validation set, optimize the classification threshold, and report performance metrics.

Functionality:

  • Computes predicted probabilities for the validation set using the stacking model.
  • Calculates precision, recall, and F1-scores for various thresholds.
  • Selects the threshold that maximizes the F1-score.
  • Generates binary predictions using the best threshold.
  • Prints the classification report (precision, recall, F1-score per class) and ROC-AUC score.

Output:

  • Best Threshold: A value, e.g., 0.5000, optimized for F1-score.
  • Classification Report: Metrics for classes 0 and 1, e.g.:
    precision    recall  f1-score   support
    0       0.99      0.99      0.99     19997
    1       0.99      0.99      0.99     19997
    accuracy                        0.99     39994
    macro avg   0.99      0.99      0.99     39994
    weighted avg   0.99      0.99      0.99     39994
    
  • Validation ROC-AUC: A high score, e.g., 0.9994, indicating excellent discriminative ability.
  • Example output:
    Best threshold: 0.5000
                  precision    recall  f1-score   support
             0       0.99      0.99      0.99     19997
             1       0.99      0.99      0.99     19997
        accuracy                           0.99     39994
       macro avg       0.99      0.99      0.99     39994
    weighted avg       0.99      0.99      0.99     39994
    Validation ROC-AUC: 0.9994
    
  • The model’s performance is evaluated, and the optimal threshold is identified.

Feature Importance Analysis

Purpose: Analyze feature importance using XGBoost’s built-in importance scores and SHAP values.

Functionality:

  • Creates a DataFrame of feature importances from the XGBoost model.

  • Plots a bar chart of the top 20 features using Seaborn.

  • Prints the top 20 features with their importance scores.

  • Uses SHAP to compute and visualize feature contributions for the validation set.

  • Output: A DataFrame listing the top 20 features, we are including only 10 e.g.:

    Top 20 Features:
                feature            importance
    0  tx_time_span_days             0.1500
    1  platform_hashflow-v3          0.1200
    2  tx_count                      0.1000
    3  tx_gas_used_std               0.068686
    4  transfer_count                0.032816
    5  tx_freq_per_day               0.019382
    6  tx_gas_used_std               0.068686
    7  swap_amount_out_usd_mean      0.011656
    8  transfer_amount_usd_mean      0.029811
    9  tx_value_usd_mean             0.032430
    10 tx_fee_mean                   0.054944
    ... 
    
  • SHAP Plot: A bar plot showing SHAP values for the top 20 features, highlighting their impact on predictions.

  • The analysis identifies key features driving Sybil detection, such as temporal and platform-specific behaviors.


Cross-Validation

Purpose: Perform 5-fold cross-validation to assess the stacking model’s robustness.
Code Cell:

Functionality:

  • Performs 5-fold stratified cross-validation on the resampled data.

  • Trains the stacking model on each fold and computes ROC-AUC on the validation fold.

  • Prints the ROC-AUC scores for each fold, along with the mean and standard deviation.

  • Output:

    Cross-Validation ROC-AUC Scores: [0.9989968553291696, 0.9990570846466095, 0.9983119511233182, 0.9990006602205515, 0.9990545813956908]
    Mean CV ROC-AUC: 0.9989, Std: 0.0003
    
  • The cross-validation confirms the model’s robustness across different data splits.


Generate Test Predictions

Purpose: Generate predictions for the test set and create a submission file.

Functionality:

  • Defines a function prepare_test_submission that:

    • Combines test addresses from Base and Ethereum datasets, ensuring uniqueness.
    • Aligns test features with the training feature order (xgb_feature_order.csv).
    • Removes duplicate indices and reindexes to match test addresses.
    • Fills missing values with medians.
    • Predicts probabilities and binary classes using the stacking model and optimized threshold.
    • Creates a submission DataFrame with ADDRESS and PRED (probabilities).
    • Saves the submission to submission_improved.csv.
  • Calls the function and prints the first five rows of the submission.

  • A file submission_improved.csv is created with unique 20,369 rows, containing addresses and their predicted Sybil probabilities.


Conclusion

Purpose: Summarize the model’s performance and suggest future enhancements.

Content:

  • Reports a validation ROC-AUC of 0.9994 (XGBoost) and F1-score of 0.9940 (stacking model).
  • Highlights key factors: SMOTE, robust feature engineering (temporal and platform features), ensemble methods, and overfitting prevention.
  • Notes important features: tx_time_span_days, platform_hashflow-v3.

Next write up :next_track_button:

:white_check_mark: Write-Up for is for my 12th submission.

The sybil-v2.ipynb.zip Jupyter Notebook implements an enhanced machine learning pipeline for Sybil detection in a blockchain context, as part of a competition by Human Passport by Holonym, Octant, and the Ethereum Foundation. The goal is to predict the probability (0 to 1) of a wallet address being a Sybil (fake identity) using historical blockchain data. This write-up provides a comprehensive overview of the model, explaining each cell, outputs, techniques used, and performance metrics, aligning with the outcomes before building the model.


Overview of the Notebook and Objectives

The notebook builds on my previous Sybil detection model by incorporating graph-based features and dynamic thresholding to improve detection capabilities. Sybil detection in blockchain involves identifying fake wallet addresses used to manipulate systems (e.g., airdrops, voting). The model processes blockchain transaction data, extracts features, handles class imbalance, trains multiple models, and uses an ensemble approach to generate predictions.

Objectives

  • Predict the probability (0 to 1) of a wallet being a Sybil.
  • Enhance feature engineering with graph-based features to capture wallet relationships.
  • Implement dynamic thresholding to optimize for multiple metrics (F1-Score, recall, precision-recall AUC).
  • Achieve high ROC-AUC, F1-Score, and recall while minimizing overfitting.

Explanation and Outputs

Import Libraries

  • Purpose: Imports necessary libraries for data processing, modeling, evaluation, visualization, and graph analysis.
  • Key Libraries:
    • Data Processing: pandas, numpy.
    • Models: xgboost, lightgbm, catboost, sklearn (Random Forest, Logistic Regression).
    • Evaluation: sklearn.metrics (ROC-AUC, F1-Score, etc.).
    • Graph Analysis: networkx.
    • Visualization: matplotlib, seaborn, shap.
    • Imbalance Handling: imblearn (SMOTE).

Data Loading

  • Purpose: Loads blockchain datasets and preprocesses the LABEL column in training data to ensure binary labels (0 or 1).
  • Datasets:
    • test_addresses_base, test_addresses_ethereum: Test wallet addresses (ADDRESS).
    • train_addresses_base, train_addresses_ethereum: Training wallet addresses (ADDRESS, LABEL).
    • dex_swaps_base, dex_swaps_ethereum: Swap transactions.
    • token_transfers_base, token_transfers_ethereum: Token transfers.
    • transactions_base, transactions_ethereum: Transaction data.
  • Preprocessing:
    • Converts LABEL to numeric, coercing invalid entries to NaN.
    • Binarizes LABEL using a 0.5 threshold (≥ 0.5 → 1, else 0).
      • 20,369 test addresses, as expected for a competition dataset.

      • 51,515 training addresses, with LABEL initially as object (preprocessed to int).

    • DEX Swaps Base Columns: ['BLOCK_NUMBER', 'BLOCK_TIMESTAMP', 'TX_HASH', ...] (25 columns).
    • Token Transfers Base Columns: ['BLOCK_NUMBER', 'BLOCK_TIMESTAMP', 'TX_HASH', ...] (18 columns).
    • Transactions Base Columns: ['BLOCK_NUMBER', 'BLOCK_TIMESTAMP', 'TX_HASH', ...] (30 columns).

Feature Engineering Overview

  • Purpose: Extracts features from blockchain data, including transaction-based, temporal, swap-based, transfer-based, categorical, and graph-based features.
  • Function: prepare_data_for_xgboost(data, is_train=True)
    • Inputs:
      • data: Dictionary with transactions, dex_swaps, token_transfers, train_addresses/test_addresses.
      • is_train: Boolean indicating training (returns features and labels) or testing (features only).
    • Steps:
      • Transaction Features: tx_count, tx_value_usd_mean, tx_fee_mean, tx_gas_used_std.
      • Temporal Features: tx_time_span_days, tx_freq_per_day.
      • Swap Features: swap_count, swap_amount_in_usd_mean, swap_amount_out_usd_mean, swap_amount_out_usd_std.
      • Interaction Features: tx_fee_to_value_ratio, swap_in_out_ratio.
      • Transfer Features: transfer_count, transfer_amount_usd_mean.
      • Categorical Features: One-hot encodes PLATFORM (e.g., platform_hashflow-v3).
      • Graph-Based Features (Optimized):
        • Samples 100,000 rows from transactions and transfers to reduce graph size.
        • Computes in_degree and out_degree directly from edge lists.
        • Builds a sampled graph and subgraph (top 10,000 nodes by degree).
        • Computes clustering_coefficient and betweenness_centrality on the subgraph.
      • Preprocessing: Fills missing values with medians, scales features with StandardScaler.
    • Output:
      • For training: Features X (DataFrame), labels y (Series).
      • For testing: Features X only.
  • Execution:
    • Processes training and test data, saving feature order to xgb_feature_order.csv.
    • Prints: Warning: Duplicate addresses found in addresses DataFrame. Dropping duplicates.
      • Indicates duplicates in train_addresses (expected due to concatenation), resolved by keeping the first occurrence.
  • Output:
    • Warning about duplicates, confirming data cleaning.
    • Features: ~44-48 columns (original + 4 graph-based: in_degree, out_degree, clustering_coefficient, betweenness_centrality).

Handle Class Imbalance with SMOTE

  • Purpose: Balances the dataset by oversampling the minority class (Sybil) using SMOTE.
  • Steps:
    • Ensures y is binary (0/1) by converting to numeric and binarizing.
    • Applies SMOTE to balance classes.
  • Output:
    • Original class distribution: {0: 96524, 1: 2543}
      Resampled class distribution: {0: 96524, 1: 96524}
      
    • Original: Highly imbalanced (97% non-Sybil, 3% Sybil), as expected for Sybil detection.
    • Resampled: Balanced, confirming SMOTE’s effectiveness.

Train-Test Split

  • Purpose: Splits the resampled data into training and validation sets.
  • Steps:
    • Uses train_test_split with 80:20 ratio, stratified on y_resampled.
  • Output:
    • Training set shape: (154438, 48), Validation set shape: (38610, 48)
      
    • Total samples: 154,438 + 38,610 = 193,048 (matches 96,524 * 2 from SMOTE).
    • Features: 48 columns, confirming addition of graph-based features.

Hyperparameter Tuning for XGBoost

  • Purpose: Tunes XGBoost hyperparameters using GridSearchCV.
  • Steps:
    • Defines a parameter grid (e.g., max_depth, n_estimators, learning_rate).
    • Uses 3-fold cross-validation with ROC-AUC scoring.
  • Output:
    • Best XGBoost Parameters: {'alpha': 0, 'colsample_bytree': 0.8, 'lambda': 1, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 200, 'subsample': 1.0}
      Best XGBoost ROC-AUC: 0.9992139265200622
      
    • Parameters: Balanced settings (max_depth=7, learning_rate=0.1), indicating a deep but regularized model.
    • ROC-AUC: 0.9992, very high, as expected due to strong features and balanced data.

Train Other Models

  • Purpose: Trains Random Forest, LightGBM, and CatBoost models with basic tuning.
  • Steps:
    • Random Forest: n_estimators=200, max_depth=10, class_weight='balanced'.
    • LightGBM: n_estimators=200, max_depth=7, is_unbalance=True, verbose=-1.
    • CatBoost: iterations=200, depth=7, auto_class_weights='Balanced'.

Ensemble Methods

  • Purpose: Implements Stacking and Voting ensembles using the trained models.
  • Steps:
    • Stacking: Combines XGBoost, Random Forest, LightGBM, and CatBoost with a Logistic Regression meta-model.
    • Voting: Uses soft voting across the same models.

Evaluate Models and Check for Overfitting

  • Purpose: Evaluates all models (XGBoost, Random Forest, LightGBM, CatBoost, Stacking, Voting) on training and validation sets.
  • Function: evaluate_model(model, X_train, y_train, X_val, y_val, model_name)
    • Computes ROC-AUC, F1-Score, recall, precision, and overfitting gap.
    • Plots confusion matrices.
  • Output:
    • XGBoost:
      Training ROC-AUC: 0.9997
      Validation ROC-AUC: 0.9993
      Validation F1-Score: 0.9909
      Validation Recall: 0.9950
      Validation Precision: 0.9869
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0004
      
    • Random Forest:
      Training ROC-AUC: 0.9948
      Validation ROC-AUC: 0.9941
      Validation F1-Score: 0.9700
      Validation Recall: 0.9800
      Validation Precision: 0.9603
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0007
      
    • LightGBM:
      Training ROC-AUC: 0.9996
      Validation ROC-AUC: 0.9991
      Validation F1-Score: 0.9901
      Validation Recall: 0.9949
      Validation Precision: 0.9854
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0005
      
    • CatBoost:
      Training ROC-AUC: 0.9987
      Validation ROC-AUC: 0.9980
      Validation F1-Score: 0.9874
      Validation Recall: 0.9933
      Validation Precision: 0.9815
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0007
      
    • Stacking:
      Training ROC-AUC: 0.9997
      Validation ROC-AUC: 0.9986
      Validation F1-Score: 0.9932
      Validation Recall: 0.9961
      Validation Precision: 0.9903
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0011
      
    • Voting:
      Training ROC-AUC: 0.9994
      Validation ROC-AUC: 0.9988
      Validation F1-Score: 0.9886
      Validation Recall: 0.9938
      Validation Precision: 0.9835
      Overfitting Indicator (Train-Val ROC-AUC Gap): 0.0006
      
    • Confusion matrices plotted for each model.

Dynamic Threshold Optimization

  • Purpose: Optimizes the classification threshold for multiple metrics (F1-Score, recall, precision-recall AUC).
  • Function: optimize_threshold_dynamic(y_true, y_pred_proba, metrics=['f1', 'recall', 'pr_auc'])
    • Optimizes for:
      • f1: Maximizes F1-Score.
      • recall: Maximizes recall with precision ≥ 0.95.
      • pr_auc: Computes Precision-Recall AUC, uses F1-optimal threshold.
    • Plots F1-Score vs. threshold, Recall vs. threshold, and PR curve.
  • Output:
    • Threshold Optimization Results:
      F1: Best Threshold = 0.7751, Score = 0.9939
      RECALL: Best Threshold = 0.7751, Score = 0.9961
      PR_AUC: Best Threshold = 0.7751, Score = 0.9986
      
    • Plots: F1-Score, Recall, and PR curves, showing threshold selection.
    • Best threshold (F1): 0.7751, F1-Score 0.9939, close to original (0.9940).

Feature Importance Analysis

  • Purpose: Analyzes feature importance using XGBoost’s feature_importances_ and SHAP.
  • Output:
    • Top 20 Features:
                        feature  importance
      4          tx_time_span_days    0.643849
      3            tx_gas_used_std    0.068686
      2                tx_fee_mean    0.054944
      12            transfer_count    0.032816
      1          tx_value_usd_mean    0.032430
      13  transfer_amount_usd_mean    0.029811
      5            tx_freq_per_day    0.019382
      27      platform_hashflow-v3    0.017512
      0                   tx_count    0.016956
      6                 swap_count    0.013493
      17         platform_balancer    0.012196
      8   swap_amount_out_usd_mean    0.011656
      7    swap_amount_in_usd_mean    0.008792
      9    swap_amount_out_usd_std    0.007483
      26         platform_hashflow    0.007422
      43            platform_woofi    0.007217
      11         swap_in_out_ratio    0.007030
      22          platform_dexalot    0.004264
      20            platform_curve    0.004062
      31      platform_maverick-v2    0.000000
      
    • SHAP plot: Not shown in output but generated for visual analysis.

Cross-Validation

  • Purpose: Performs 5-fold cross-validation to ensure robust performance.
  • Steps:
    • Uses StratifiedKFold to maintain class balance.
    • Trains the Stacking model on each fold, computes ROC-AUC.
  • Output:
    • Cross-Validation ROC-AUC Scores: [0.999516648229417, 0.999653417270615, 0.9992871301288809, 0.999347305662059, 0.9993718291117792]
      Mean CV ROC-AUC: 0.9994, Std: 0.0001
      
    • Mean ROC-AUC 0.9994, Std 0.0001, indicating high performance and consistency.

Generate Test Predictions

  • Purpose: Generates predictions for test addresses using the Stacking model and optimized threshold.
  • Function: prepare_test_submission(model, test_features, test_addresses_list, threshold, output_file)
    • Aligns test features, predicts probabilities, applies threshold, saves submission.
    • Submission saved to ‘submission_improved_2.csv’
    • 20,369 unique test addresses

Techniques and Methods for High Accuracy

Data Preprocessing

  • Label Binarization: Ensured LABEL is binary (0/1), handling continuous or invalid values.
  • SMOTE: Balanced classes (from 96,524:2,543 to 96,524:96,524), mitigating bias toward non-Sybil.
  • Feature Scaling: Used StandardScaler to normalize features, improving model convergence.

Feature Engineering

  • Temporal Features: tx_time_span_days (0.6438 importance) captures Sybil behavior (short activity spans).
  • Graph-Based Features: Added in_degree, out_degree, clustering_coefficient, betweenness_centrality to identify network patterns, though their importance was lower than expected due to sampling.
  • Platform Features: platform_hashflow-v3 (0.0175) highlights platform-specific Sybil behavior.

Model Training

  • XGBoost: Achieved ROC-AUC 0.9993 with tuned parameters (max_depth=7, learning_rate=0.1).
  • Ensemble (Stacking): Combined models, achieving F1-Score 0.9932 and recall 0.9961, leveraging diverse strengths.

Dynamic Thresholding

  • Optimized for F1-Score (0.9939), recall (0.9961), and PR-AUC (0.9986), ensuring flexibility in metric prioritization.

Overfitting Handling

  • Small Gaps: Overfitting gap < 0.001 for most models (e.g., XGBoost 0.0004), indicating minimal overfitting.
  • Cross-Validation: Std 0.0001 confirms consistency.
  • Regularization: Used in XGBoost (lambda=1), Random Forest (max_depth=10), and other models.

Feature Importance

  • Top Features:
    • tx_time_span_days (0.6438): Dominant, as expected.
    • tx_gas_used_std (0.0687), tx_fee_mean (0.0549): Indicate irregular transaction patterns.
    • Graph Features: Not in top 20, possibly due to sampling; may improve with larger subgraphs.

Conclusion

The model achieves exceptional performance (ROC-AUC 0.9994, F1-Score 0.9939, recall 0.9961), meeting expectations for Sybil detection. Graph-based features and dynamic thresholding enhance detection, though further optimization (e.g., larger graph samples) could improve graph feature importance. The model is robust, consistent.

My Sybil Detection Model: A Hybrid Approach

Code :GitHub - AswinWebDev/Sybil-Detection_2

So, here’s the deal with my model: I’ve built it to catch Sybil wallets by using a mix-and-match (hybrid ensemble) strategy. The main idea is to get really good features from on-chain data and then use a few different modeling angles to make the final call.

How My Model Works at Its Core

My solution uses an ensemble of multiple specialized models, each contributing to the final prediction:

  1. The Ethereum Expert (25% weight): A LightGBM model trained exclusively on Ethereum chain data, focusing on transaction patterns, token transfers, and DEX activities specific to Ethereum.

  2. The Base Chain Specialist (10% weight): A LightGBM model trained on Base chain data, identifying Sybil patterns unique to this ecosystem.

  3. The Combined Model (40% weight): A LightGBM model that processes features from both Ethereum and Base chains, capturing cross-chain behavioral patterns that might indicate coordinated Sybil activity.

  4. Etherscan Features Model (15% weight): When enabled, this component incorporates additional on-chain data from Etherscan’s API, providing supplementary signals about address behaviors.

  5. Pattern Detection (0% weight): While the code includes infrastructure for pattern-based detection (with a defined 10% weight), this component is not currently active in the final ensemble.

I combine the predictions from these components using their respective weights, then apply a post-processing step (enhance_predictions.py) to refine the probability scores. This final adjustment helps optimize the prediction distribution for better AUC performance without changing the model’s fundamental decisions.

The Data I Used and How I Kept It Clean

I pulled in data from a couple of places:

  • This Competition’s Data: The main dataset provided for the current Sybil detection challenge, covering both Ethereum and Base chains.
  • External Data from CryptoPond ModelFactory #2: I also incorporated an external dataset from a previous Sybil detection competition hosted on CryptoPond (https://cryptopond.xyz/modelfactory/detail/2). This gave me more examples of Sybil and non-Sybil behaviors.

Data Leakage - A Big Watchout:
I was pretty careful about data leakage. My load_addresses function in the main script has checks for a couple of things:

  1. It makes sure no addresses from the competition’s test set are accidentally sitting in my training set.
  2. It also checks if any addresses from the external training data (from CryptoPond) are present in this competition’s test set. This is super important to prevent unrealistic performance boosts.
    If any such leaky addresses are found, my script logs how many there are (e.g., “Data leakage detected! X addresses found…” or “CRITICAL LEAKAGE: Y addresses from external training data are in the competition test set!”) and then automatically removes them from the training data I use. This way, the model is trained and tested fairly.

Cool Features I Engineered

I didn’t just use raw data; I cooked up a bunch of features like:

  • Transaction Basics: How much, how often, transaction values, gas fees, and if they’re talking to smart contracts.
  • Token Moves: How ERC-20 tokens are being passed around, the variety of tokens an address holds, and interactions with specific well-known tokens.
  • DEX : How often addresses are swapping on DEXs, what tokens they’re swapping, and the amounts.
  • I also added features from looking at cross-chain behavior (are they doing the same thing on ETH and Base?),

Some Hurdles I Ran Into (And What I Learned)

I tried a few advanced things, and here’s how that went:

  1. Etherscan API for More Clues:
    Etherscan has awesome extra info (like address tags or if a contract is verified) that could really help.
    Trying to get this Etherscan data for every single training address, It takes forever. The API calls are slow, you get rate-limited, and it just wasn’t practical for the sheer number of addresses. I’ve got the code to do it (USE_ETHERSCAN_FEATURES flag and all), but I’m barely using it in the main pipeline right now to keep training times reasonable. I tried to get ssome etherscan data by parallely running multiple apis but still it take too much time. The infrastructure exists in the codebase, but it’s not currently active in the main pipeline. For a competition, it’s tough to rely on this for all addresses unless you’ve got a ton of time or a clever way to get the data beforehand. I have created scripts to fetch data for the train addresses and integrated to use them in our model but due to api limitations and time limit its not possible in due time.

  2. Graph Neural Networks (GNNs):
    Just getting the data ready, making the subgraphs, and then training them takes a massive amount of time and computing power. I couldn’t quite get it fully working within my project’s timeframe.

What I Might Try Next

If I had more time or resources, I’d look at Etherscan And GNNs again, maybe with a more efficient data setup or by using them on smaller, specific groups of transactions.

So, that’s my model – trying to be smart and practical, with a few ideas for making it even better!


Sybil Detection Challenge: Our Modeling Approach

After a little bit of prodding from some friends of ours, we took on the Sybil detection challenge hosted by Human Passport, Octant, and the Ethereum Foundation. Our approach focuses on comprehensive feature extraction from on-chain activity, robust modeling with LightGBM, and heavy use of graph-based and behavioral insights to differentiate between Sybil and human wallets.


1. Feature Engineering Across Chains

We started by building rich behavioral profiles for each wallet using data from both Ethereum and Base chains. This included transactions, token transfers, and DEX swaps.

We engineered features in the following categories:

  • Basic Activity Metrics: Counts of sent and received transactions and token transfers, as well as the diversity of counterparties interacted with.
  • Monetary Behavior: Aggregated ETH and token values, standard deviations, and fee ratios.
  • Temporal Cadence: Wallet activity timelines such as first/last seen days, active days, inter-transaction timing, and weekday/hour entropy.
  • Gas Usage: Statistical summaries of gas usage and low-gas behavior flags.
  • Swap Behavior: DEX activity including number of swaps, swap pair diversity, and USD volumes.
  • Ping-Pong Patterns: Detection of rapid back-and-forth transaction motifs within small time windows.
  • Graph Embeddings: We built an undirected graph using all transaction, transfer, and swap edges, and used Node2Vec to embed each wallet into a 128-dimensional vector space capturing its structural role in the network.

We applied all of the above feature extractions individually to Ethereum and Base chains. Then we created ratio features comparing Ethereum to Base behavior to detect inconsistencies that are typical in Sybil activity.


2. Graph Embedding with Node2Vec

The embedded transaction graph helped capture wallet position and behavior in the broader network. We used Node2Vec with tuned parameters (walk length, context window, etc.) to generate dense vector embeddings for each address. These embeddings were merged with engineered features to form the final feature matrix.


3. Data Preparation and Augmentation

We combined labeled addresses from both Ethereum and Base chains, removing duplicates to create our training set. We also integrated the Ben2k list of suspected false negatives by relabeling them as Sybil addresses.

To combat class imbalance (about 1:38 Sybil:human), we applied undersampling to the majority class during training.


4. Model Training and Optimization

We used LightGBM for model training, which offered a good balance between speed, interpretability, and performance. Our pipeline included:

  • Stratified 7-fold cross-validation
  • Optuna-based hyperparameter tuning per fold (40 trials)
  • Fold-wise out-of-fold predictions and AUC reporting
  • Feature importance tracking for gain and split

After evaluating CV performance (OOF AUC), we retrained on the full dataset using the median of best-performing hyperparameters and average best iteration count.


5. Prediction and Submission

The final model was used to generate probability scores for all test addresses. We ensured full coverage of the test set and saved the results as a compliant CSV for submission.


6. Final Thoughts

This competition challenged us to deeply understand behavioral and structural signals of Sybil activity. Our approach emphasized a multi-layered analysis: raw transactional patterns, temporal dynamics, network structure, and cross-chain inconsistencies. We’re proud of the system we built and look forward to seeing its impact in ongoing Sybil resistance efforts.


Appendix: The Feature Buffet ! :curry:

Below is a richer tour of the ten feature families we baked into the model. Think of each family as a camera angle on wallet behavior; together they create a full-body X-ray that exposes the cardboard cut-outs hiding among real humans.

  1. Activity Volume – “How much noise do you make?”
  • (Transactions and Transfers) A Sybil farmer usually spins dozens of low-stake wallets that do something just often enough to scrape eligibility rules. Counting how many times an address pushes (or receives) ETH or tokens lets us spot:

    • Under-active ghosts – nearly empty wallets created only for claim day.
    • Over-active bots – conveyor-belt addresses that fire hundreds of micro-tx per hour.
  • (Unique tokens touched) Legit users’ holdings tend to reflect personal taste, while Sybils favor whatever asset qualifies for a campaign. A wallet that sends and receives 32 different obscure airdrop tokens but never stable-coins? Suspicious.

  1. Monetary Footprint – “Show me the money”
  • (ETH statistics (sum, mean, median, max)) High-balance whales and zero-balance zombies behave very differently; both extremes are easy to flag. We also compute

    • Coefficient of variation – Sybils often repeat the exact value (e.g., 0.005 ETH) to satisfy “non-zero” filters without overspending, yielding almost-zero variance.
  • (Fee ratio (gas fees vs. value sent)) Humans hate overpaying. If a wallet routinely burns fees that dwarf the value being moved, it might be an automated faucet script that doesn’t care about ROI.

  • (USD volume on tokens) Translating token amounts into USD normalises across meme-coins and blue-chips, exposing wallets that shovel large volumes of valueless tokens just to look busy.

  1. Temporal Rhythm – “When do you show up?”
  • (First seen / Last seen / Lifetime) A wallet born yesterday and already hyper-active screams throwaway burner.

  • (Active-days count) Real users visit sporadically over months; Sybils often crunch everything into a single weekend.

  • (Mean & standard-deviation of inter-tx seconds + Burst-iness z-score) Bots fire in tight, consistent clusters. Humans are messy: morning coffee, lunch break, midnight DeFi spree. Extreme regularity is a red flag.

  • (Weekday & hour entropy) Entropy near zero ⇒ all actions happen on one weekday/hour (scripted). High entropy ⇒ spread across the calendar (organic).

  • (Night-owl ratio (00-05 h UTC)) Legit Western users sleep; Sybil scripts happily continue. A sky-high night ratio is incriminating.

  1. Gas-Price Habits – “Do you pinch gwei?”
  • (Mean / median / std gas price + Coefficient of variation) Batch scripts often reuse fixed gas parameters; variance collapses.

  • (Low-gas counter (bottom 10 % by block)) Farmers queue transactions at absurdly low prices to save pennies; patient humans generally pay market rate to avoid stuck txs.

  1. Counter-Party Diversity – “Who are your friends?”
  • Unique addresses contacted (outgoing) and unique sources (incoming)
    • Tight bubble: one wallet whips tokens between the same 3 siblings → Sybil farm.
    • Wide circle: dozens of unrelated peers → natural trading or DeFi use.
  1. Swap Behaviour – “Do you actually trade?”
  • Swap count - Airdrop-only wallets often skip DEXs completely; professional farmers do one obligatory swap of tiny size.
  • Pair diversity - Engaged DeFi users explore many markets (ETH/USDC, wBTC/ETH, etc.). Low diversity means checklist activity.
  • Total USD swapped - If the dollar volume is tiny but fees are paid anyway, motive is probably eligibility, not profit.
  1. Ping-Pong Flag – “Are you bouncing tokens with yourself?”

We scan for back-and-forth transfers between two addresses inside a 10-minute window. That self-wash is textbook Sybil: inflate “number of engaged wallets” metrics without spending real money.

  1. Social Graph Embeddings – “Who sits near you in the giant address graph?”
    All tx, transfers, and swaps across both chains are stitched into a single undirected network. Node2Vec then wanders these edges to learn a 128-dim vector for every node—capturing:
  • Closeness to popular hubs (DEX routers, bridges).
  • Membership in dense farmer clusters (many edges but low entropy).
  • Isolation (bridges only to its creator’s main wallet).

These embeddings often reveal structure the naked eye misses.

  1. Cross-Chain Personality – “Are you a chameleon?”

For every Ethereum feature that also exists on Base we compute eth / base ratios. Genuine users often behave similarly across chains (same bedtime, same gas strategy). A ratio that explodes—for example, 100× more swaps on Base than on ETH—may indicate a wallet spun up solely to farm a Base campaign.

  1. Meta-Flags & Hygiene
  • Missing-value masks – sometimes “feature = 0” is informative (e.g., never swaps).
  • Infinity guards & universal fill-0 – keeps LightGBM happy and prevents freak explosions in ratio features.

DATA & PRE-PROCESSINGS

  • merge base and eth data together for dex_swaps, token_transfers and transactions
  • use Ben2k’s false negative list to fix the training data
  • get addresses of all contracts/dex’s etc from flippside and remove these addresses from training data (idea stolen from David, thanks lol)
  • create a new dataset for all addresses

FEATURES

  • temporal features seeemed fairly useful
  • gas features too
  • ratios against key metrics also yeiled positive results

A FEW NOVEL FEATURES

  • Benford’s law
    • Takes the first digit of a set of numbers
    • Compares that digit against a normal distribution of expected digits
    • So the idea behind benford’s law is that within natrual, non fraudulent datasets we would expect a certain distribution in the frequency of the first integer of a measurement.
    • For instance if you asked 100 people how much cash they had physcially they had on them. You’re much more likely to get answers of 130, 15, 30, 50, 12, 6. Something like that. But you probably wouldnt get a lot of answers of 731, 456 or whatever. So thats the law. Thats the approach.
  • Pattern Recognition
    • This approach led to no decernable model imporvements, I got rid of it, however. I still believe the approach has merit, could be adapted to be a high performing feature if future sybil/fraud detection models.
    • Anyway, the approach is basic. I look for patterns in where an address appears, in either TO or FROM. Then classify it. So if a address appears in chronalogical time. TOFROMFROMTOTO, i would classify it 12211. I tried this for patterns of 3 and 5. adding a collumn for each pattern and then marking it as 1 if it belonged to that pattern.
    • This was done for the first 3/5 instances and last 3/5 instances.
    • My thoughts were sybils may have some commonality in how they interact/are interacted with.
    • This really should be further explored amongst interactions of all addresses. If 2 addresses interact with each other and share the same pattern, I would imagine that would be a nice sybil indiactor.

TRAINING

Catboost was used. I found my model to heavily overestimate the number of sybils in the test set. Even though training AUC scores were aight ~0.95. The results in the leaderboard. Kinda dreadful.

So when training i opted to heavily weigh the non-sybil addresses. That may seem counterintuitive. But as my model was already overestimating the number of sybils. It was neccesary

class_weights={0: 10, 1: 1}

POST-PROCESSING

  • I used some public lists, from optimisim, flipside and Hop to set addresses after training

INSIGHTS & CONCLUSION

This was pretty fun actually. And would be good to explore these models further in the future. I would love for there to be quicker feedback from the leaderboard, but can understand the constraint. Also, it might be good idea to build a repository of existing sybil lists, some were hard to come across and had many false positives. A source of truth would could be very useful in the future.

:hammer_and_wrench: Approach Overview

My approach combines transaction behavior analysis, address-level statistics, and basic graph-based features to predict Sybil wallets in the Ethereum ecosystem. The goal was to capture subtle signals that distinguish genuine wallets from coordinated Sybil actors.

:magnifying_glass_tilted_left: Features Used

I engineered a combination of the following features:

Transaction frequency (daily/weekly activity)

Unique interaction count (number of unique addresses interacted with)

Diversity of protocols used (smart contracts engaged)

Incoming vs outgoing transaction ratio

Gas usage patterns

Time between first and last transaction

Ben2k’s potential false negatives list, used to improve label quality in training

These features were normalized and used to train classification models.

:robot: Model Used

After testing various models (Logistic Regression, XGBoost, and LightGBM), I settled on LightGBM for its balance of speed and performance. It handles sparse features and irregular distributions well.

Hyperparameters were tuned using 5-fold cross-validation.

:test_tube: Evaluation

Performance was tracked using the leaderboard validation set (up to wave 4). Key metrics I monitored:

AUC

Precision at top-k

Recall on Ben2k’s suspected false negatives (to minimize missed Sybils)

:brain: Insights Gained

Ben2k’s false negatives list was helpful in identifying mislabeled data. Retraining with reweighted labels slightly boosted performance.

Wallets with very low or very high transaction activity are more predictable; mid-range wallets are harder to classify.

Behavioral patterns (such as repeated interactions with a few contracts) were strong indicators of Sybil behavior.

:white_check_mark: Final Submission

I have submitted predictions for the full test set, as required, and this write-up summarizes my methodology.

Thanks to the Octant team for hosting this insightful and challenging competition.

Best regards,
[AbbaatyDAO]

Happy to share that we have completed preliminary reviews on each submission.

Thank you all for the long wait while we completed and digested the judge reviews. Expect the final prize winners to be announced this week, from among the following shortlist

Rhythm
Polikir
egideons
Uncharted
Mahboob
Alertcat
Limonada
David Gasquez
Stakeridoo
Oleh_RCL
Ash
Omniacs
Jpegy

Below is the raw comments from the judges on each participant

@Rhythm2211

Judge 1: In my top 14

Writeup is very thorough – 8 personally appreciate the level of detail, though there are places where editing for conciseness would be helpful. 8 find the overall data science and modeling work compelling. Pros: he data-driven insights for distinguishing Sybil population: reasonable metrics, with clear graphics, nice work done in Feature Engineering (Section 4). Potential Changes: A bit dense, could be edited down without sacrificing important information. Would also like to know more about methodology of feature importances, to ensure it is not succumbing to known issues (see e.g. [2305.10696] Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance)

Judge 2: 6/7

Textbook standard. One of the only submissions to share exploratory work and it’s a deep dive from many different angles. Appreciate the emphasis on cross-validation and well-executed ensemble. Could be more concise.

Judge 3: 4/5

Solid work. Detailed EDA that clearly informed feature engineering, and a robust ensemble achieving great scores. A bit too conventional compared to other submissions though

@achankun

Judge 1: Pros: Writeup does a good job balancing detail with brevity. Modeling choices (architecture and parameters) are explained and justified. Key insights that can be built on in the future. Potential Changes: There are a few places where adding specific numbers or visuals would add a lot (e.g. “Top 10 Features by Importance” would be benefit from more context like the relative relationship between the importances). Also
it would be helpful to know more about the feature importances methodology to ensure it isn’t succumbing to known issues e.g. [2305.10696] Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance

Judge 2: 5/7

A great demonstration of how doing the basics well can get you most of the way there. Solid model choice and feature engineering. Developing some of the features mentioned near the end of this write-up would help improve this submission further.

Judge 3: 3/5

Great implementation of basic aggregates and LightGBM, but very conventional. No attempts to explore deeper Sybil signals.

@ujju513

Judge 1: Pros: New and informative features engineered, clear description of modeling choices including both architecture and parameter values. Ideas explored in “Next Steps”. Potential Changes: Offer more justification for design decisions. For instance, what specific information about features was used in determining whether to use mean or median for imputing values, and what justifies the choice of Label Encoding over some kind of Bayesian-informed smoothing paradigm?

Judge 2: 4/7

Overall a solid approach. Better feature selection would probably improve model performance. Doesn’t do anything novel but does the basics in a standard way.

Judge 3: 3/5

A competent approach but sticks closely to standard tabular methods. Preprocessing and feature engineering are sound, but the submission lacks real insight into Sybil detection strategies.

@OG_KIRILL_F12

Judge 1: In my top 6

Pros: Writeup has a clear and conversational tone. The data cleaning techniques were appropriate – appreciated the emphasis on removing features with missing values or correlation to other features. Novel loss function. Potential Changes: More explanation of the architecture by which final models are combined. Also, the metrics are cool but the ultimate recommended “decision criteria” are unclear to me.

Judge 2: 3/7

This general approach could plug and play with any dataset. It doesn’t show specific insights into sybil behavior. Feature generation is throwing the kitchen sink at us but is somewhat saved by the amount of pruning

Judge 3: 3/5

Great to see the considerations of graph metrics and the use of Focal Loss for class imbalance. However, the write-up lacks details on features construction and how graph features concretely impacted performance, which would have strengthened the submission

@bigbrother

judge 1: Pros: Clear focus on false negatives, which is a difficult part of the problem. Explanation of engineering choices made. Incorporated external data set of false negative sybils. Potential Changes: More emphasis on explainability and insight. There is no description of performance. It would be nice to see concrete insights as to what makes something a false negative Sybil account, and what a system needs to do to recognize them.

judge 2: 4/7

Repairing labels is a cool idea but this write up lacks evidence that it improved the results. Random Forest is a classic approach otherwise and this is a solid swing at bat.

judge 3: 3/5

I appreciated the considerations around false positive corrections and including them as a mechanism, but I’d have loved more details on how they were implemented or on the specific patterns detected to guide those corrections, or more quantitative results generally

@egideons

Judge 1: In my top 14

Pros: 8 personally like the use of graphics to explain and describe different concepts. (The graphics also match my preferred cyberpunk esthetic.) Appreciate the discussion of technical challenges encountered, as it feels both authentic and valuable to the community. 8 find the architecture and performance optimizations very compelling. Potential Changes: 8m going to challenge any feature_importances methodology that isn’t explained/justified. There are many potential issues with off-the-shelf methods [2305.10696] Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance

Judge 2: 6/7

Thoughtful approach to EDA and feature engineering. The diagrams are appreciated and the writing is clear. Well-executed throughout.

Judge 3: 4/5

Great to see clear documentation of technical challenges, and creative confidence-boosting strategies.

@AMDYES

Judge 1: Pros: The writeup is straightforward to read, and has a nice hook. Use of external data for false negatives. Valuable signals in the features engineered and the summary insights. Potential Changes: Markdown formatting is a bit wonky. Would be nice to see discussion/reflection on actual model performance, and/or other modalities (graphs, numbers, etc) to give more insight to the observations surrounding Sybil characteristics.

Judge 2: 5/7

Easy to read write-up and does the basics well. Appreciate the focus on building network and temporal features likely to be successful in sybil detection. Mentions trying different models but does not state if they selected one or went with an ensemble.

Judge 3: 3/5

Great foundations. But the writeup feels light, with limited detail on modeling validation, performance metrics, or how the final predictions performed. More precise results and structured analysis would elevate confidence in the approach.

@debiao

Judge 1: debiao: Pros: Low-complexity model. Highly explainable, offers insight into underlying data – might be especially useful as part of a larger ensemble or pipeline. Appreciate the value-add of personal reflection/experience on “high alpha KOLs”. Potential Changes: Normal distribution doesn’t feel like an intuitive choice to me, since the distribution is clipped at 0 (no one has negative number of transactions). Also would want to see this compared to another simple method such as a DummyClassifier. What does this method do with the base chain transactions?

Judge 2: 2/7

A really simple model and an audacious submission that still sees mid-level results. I wouldn’t use it in production but the metrics demonstrate that good feature choice alone can get you quite far.

Judge 3: 1/5

Too simplistic for serious Sybil detection. Not a viable model.

@0xZphr

Judge 1: Pros: Nice high-level summary of design choices re: feature selection and model choice. Potential Changes: More details would be helpful – there are lots of ways to obtain a Random Forest Classifier, so we’re curious how the final model came into being.

Judge 2: 5/7

Vanilla approach with solid features.

Judge 3: 3/5

Solid execution with reasonable features and clear explanations, but remains generic, with no creative modeling or advanced insights to set it apart from others.

@hutchersonkeeland

Judge 1: Pros: High-level summary of modeling choices. Potential Changes: Notebook is referenced, but not included. Would be nice to see more granular description of technique, or more creative approach to working with the data – this feels a bit like an outline.

Judge 2: 4/7

Solid and straightforward approach. Could have focused on finding more sybil-specific insights and features. A bit light on detail.

Judge 3: 3/5

careful pipeline with sensible feature choices, but entirely classic tabular ML without any exploration of more sophisticated relational or temporal dynamics. The writeup could benefit from more details as well

(Continued post as can only tag 10 users at a time)

@Candy

Judge 1: In my top 14

Pros: Writeup is detailed and Well-organized. It’s always nice to see concrete efforts to optimize code, since these can generalize to wider swaths of the problem space. Potential Changes: Going to call out feature_importances work to make sure it addresses known issues with off-the-shelf techniques see [2305.10696] Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance. The “basic features” section appears to be missing details.

Judge 2: 5/7

Cool to see the multithreading runtime improvements. Fairly standard modelling. More focus on validation would improve the submission.

Judge 3: 3/5

Technically strong but lack of model evaluation metrics prevents a clear assessment of predictive performance, limiting practical confidence.

@uncharted

Judge 1: In my top 14

Pros: Demonstrates solid data science fundamentals. Very detailed writeup. 8 appreciate use of external dataset to help address unbalanced labeling. Approach emphasis cross-chain data, with justification for design choices. Tone feels authoritative and authentic. Nice insights from the graph data. Cool to see the value of the the bucketing approach and discussion of its performance. Potential Changes: 8.appreciate thorough and detailed reports, but this one might benefit from editing for conciseness and clarity. 8m kind of drowning in details at a certain point. Also going to request more details and/or reflection on the feature importances measurements, to address issues with off-the-shelf techniques [2305.10696] Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance

Judge 2: 6/7

Very well developed features and model. Especially nice to see the graph-based components. I applaud the effort to add more external sybil labels but I wonder if this helped or hurt the model performance given the prevalence of false sybils in many online lists.

Judge 3: 4/5

Great depth and rigor across all phases, and the write-up was pleasant to read. I loved how thoughtful it was in interpreting results and iterating on the approach

@ewohirojuso

Judge 1: Pros: Gives clear and thorough overview of process, particularly the data-cleaning and preprocessing steps. This may be helpful for
formatting or pre-cleaning data in the future.
Potential Changes: Generally, 8_would like to see more novelty or unique insight. This report is similar to others in technique and description, On a more granular fundamental level – there are concerns about setting all missing values to constant -1 – this shifts the distribution (for instance, lowering the mean and median) and introduces de facto unrealistic outliers. It would be good to know more about this design choice

Judge 2: 4/7

Solid approach that hits the basics well including cross validation and feature development. Nothing outside of the standard and would have been nice to see some performance metrics.

Judge 3: 3/5

Thoughtful preprocessing and label correction, but a bit too conventional and missing clear performance metrics.

@xingxiang

Judge 1: Pros: Clear overview of process. Appreciate the coherent design philosophy: “Convert complex blockchain data into feature vectors usable by machine learning” and “Automate model training and optimization processes to ensure optimal performance”.

Potential Changes: The actual prediction models (e.g. Random Forest, XGB, etc) being used in the ensemble are never discussed. This is an important piece of missing information. For imputing missing values, there are concerns about setting all missing values to constant -1 – this shifts the distribution (for instance, lowering the mean and median) and introduces de facto unrealistic outliers. It would be good to know more about this design choice. Would also be nice to have more representations of information (e.g. some images, tables etc) and a share of any insight gained from working on this.

Judge 2: 5/7

Clear and modular pipeline making good considerations of hyperparameters, cross validation, and class imbalance. Doesn’t try anything novel but does the basics well and has acceptable metrics

Judge 3: 3/5

A modular system with good engineering and strong performance, but the overall approach sticks to standard tabular ML without pushing creative boundaries.

@gespsy

Judge 1: In my top 14

Pros: Great writeup that is thorough and structured, without feeling formulaic. Design choices are justified. Insight is provided through discussion of less-performant techniques. Appreciate the visuals that help explain correlation in the original data, as well as the optimization approach.

Potential Changes: Would be nice to see more discussion of which features offered insight, beyond the feature importances. Off-the-shelf techniques for this are known to have issues. [2305.10696] Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance

Judge 2: 5/7

Solid execution but the approach sticks to standard approaches without new insights or deep performance gains

Judge 3: 3/5
Solid technically and methodically sound. However, it sticks to conventional approaches without exploring more innovative angles or deeper behavioral patterns

@mahboobbiswas

Judge 1: Pros: The modeling steps are well-explained. 8.appreciate the granularity of the optimization…Useful insight into the data is provided, e.g. the cleanliness/usability of the data based on blockchain.

Potential Changes: There are some jarring transitions/inclusions in the writeup, that look copy/pasted from the original instructions. Would like to see feature importances cross-validated somehow, since there are known weaknesses in off-the-shelf techniques. While many of the performance metrics look good – nearly 15% of the positive-label predictions are false, 8_would&view as an issue to address.

Judge 2: 6/7

I like how this reads like a notebook tutorial. Could be a great guide for newcomers and achieves solid performance.

Judge 3: 4/5

Great depth and thoroughness! Covered every stage of the pipeline, from data prep to advanced evaluation, and explored aspects like dynamic thresholding and graph-based features.

@alertcat

Judge 1: In my top 14

Pros: GPU-accelerated approach offers technical novelty and potential optimization. 8 personally enjoy the esthetics of the visuals. The verbal descriptions of the features (e.g. “network interaction depth”) are clear and helpful.

Potential Changes: Since GPU acceleration is one of the primary focuses, it would be great to understand how it improved performance (either in prediction accuracy or efficiency). Some of the graphics have formatting/detail issues. For instance, one image says it compares feature importance for two different models – but only one of the models is shown in the chart.

Judge 2: 5/7

Cool to see RAPIDS being used. Besides that the approach is fairly standard and well executed.

Judge 3: 4/5

Detailed, structured, and comprehensive write-up that clearly communicates the methodology and findings. Demonstrated great rigor from problem framing to technical implementation and insights.

@aabo0090

Judge 1: In my top 14

Pros:
The model and data cleaning pipeline seem well-engineered. It’s nice to see Logistic Regression, which offers high degree of interpretability. Really nice to see the “Future Work” being aware of the potential for graph/network data.

Potential Changes: The writeup is a bit vague and bland. Would like to see more detail about key features in the documentation (not just referencing to code). There are unique considerations for discussion in Production Consideration (such as drift detection) – but it would be great to see more detail/evidence for the claims

Judge 2: 4/7

Clear and methodical approach with an interesting ensemble..

Judge 3: 3/5

A solid, production focused work with excellent operational thinking, but lacking quantitative metrics reduces confidence in its effectiveness.

@Limonada

Judge 1: In my top 6

Pros: Writeup is clear and descriptive. Offers insight beyond standard/tutorial level, with design choices well-justified. Experimented with a wide variety of techniques, rather than just jumping on popular ones. Nice engineering/design to incorporate the multiple models into a meta-model.

Potential Changes: With performance this getting, overfitting is always a concern. Would be helpful to have sanity checks and/or some degree of interpretability

Judge 2: 4/7

Write-up is light on feature details but the author pays attention to class imbalance, cross validation, and building a good meta model that achieves high performance.

Judge 3: 3/5

Great but standard implementation. I liked the consideration given to handling class imbalance, but the work lacked deeper analysis or exploration of innovative features

@davidgasquez

Judge 1: In my top 3

Pros: Very strong writeup; appreciate both the conversational tone and the resources list. It’s great to see well-researched design, with acknowledgment of prior work in the field. 8_really_appreciate the willingness to experiment with new approaches, and the completely-appropriate-for-sybils graph-based approach of node2vec in particular. The graphics are wonderful. Thanks for the explanation of the exploratory process, incorporating revisions and offering insight rather than just giving a final product.

Potential Changes: Not so much changes to existing work, as excited discussion of potential future. Would be awesome to see some of the ideas towards the end of te post, explored more in the future. Since the model performance is so strong, it would be interesting to see what the misclassifications are – thinking in terms of a potential adversarial generator.

Judge 2: 6.5/7

Awesome write up and modeling work. Love the inclusion of graph features, especially community-based ones. Crystal clear thought process

Judge 3: 5/5

One of the best submissions. A well-executed approach with great use of graph features, external resources.. Great write-up as well

(final post completing reviews on each submission)

@chrisrellama

Judge 1 : Pros: Showcases a clear pipeline, with reasonable steps in data cleaning, feature engineering, and modeling. It’s always nice to see an ensemble consisting of multiple different models types.

Potential Changes: Would be nice to see more justification of design choices – any reason why you’re using these particular models, or combining them in this way? It would also be cool to have more insight into the underlying data – what do we now know about sybils that we didn’t before?

Judge 2: 4/7

Solid and standard approach. Wish there were some performance metrics included in the write-up and more sybil specific feature engineering.

Judge 3: 3/5
​​
Great ensemble approach. However, the write-up lacks detail, omits key performance metrics, and sticks to standard techniques without exploring other methods.

@Stakeridoo

judge 1: In my top 3

Pros: Writeup has unique style. Additional SQL-based data sources helped strengthen the overall data set, by incorporating multiple false negatives. Clear explanation of philosophy, with sanity checks based on human-verifiable ground truths like known addresses. Graphics help clarify methodology. Use of network features like pagerank is appropriate and innovative. Issues with performance (like false negative rate) are highlighted and explained in a way that offers insight.

Potential Changes: 8 understand the graph-theory metrics, but it might be nice to explain them a bit. The feature_importances choice may need justification, since it’s a technique with known issues. While 8 acknowledging the value in proprietary approaches, this one feels like it could say a bit more without being too loose-lipped.

judge 2: 6.5/7

One of the best approaches tailored to sybil specific behavior. Great to see they pulled in external datasets and applied their subject matter expertise to going way beyond a generic model.

Judge 3: 5/5

Great to see the use of the SQL heuristics, advanced graph metrics, and behavioral features from real world Sybil hunting. Deep domain knowledge

@Oleh_RCL

Judge 1: In my top 3

Pros: Very nice verbal explanation of feature engineering choices. Cool to see some unique modeling and engineering attributes (GraphSage, Isolation Forest, etc).

Potential Changes: Since the setup showed such insight, 8 was selfishly hoping for data-based insight into model performance.

Judge 2: 6/7

Robust feature engineering geared toward sybils with temporal, and graph/community features making an appearance. Nice to see use of a GNN and wish there was more detail on how this contributed to overall performance.

Judge 3: 4/5

Comprehensive ensemble with thoughtful feature engineering and some advanced ideas explored like GNN embeddings.

@Learn24

Judge 1: Pros: Appreciate explicit statement of philosophical guidelines. Clear process for each step in pipeline.

Potential Changes: While 8_appreciate simplicity, this model feels under-developed. Not sure if cross-validatjon beats a simple “always predict Sybil” Model? Would be nice to have more insight into model performance, or at least into underlying data set..

Judge 2: 4/7

Standard and straightforward approach with respectable performance. Doesn’t try anything new or lean into sybil specific behavior recognition.

Judge 3: 3/5

Systematic data aggregation and disciplined cleaning show competence, but the approach remains too conventional with no exploration of deeper behavioral patterns.

@vicade

judge 1: Pros: Appropriate features and model chosen. Writeup is brief. It’s really great to see explicit references to random seeds, to ensure model reproducibility.

Potential Changes: The writeup doesn’t give much detail or insight into what distinguishes the model, and/or what was learned along the way. It would be helpful to include some personal reflection on the process, and some insight that might help future builders.

Judge 2: 3/7

Very short write-up. On one hand, I appreciate brevity but on the other hand I wish more attention was paid to cross validation, addressing class imbalance, and mostly just thoughtful feature engineering. No performance metrics.

Judge 3: 2/5

Sound but standard tabular ML approach lacking any exploration of advanced or creative techniques. Write up lacks details and insights as well

@ash

Judge 1: In my top 14

Pros: Appreciate the inclusion of link to source code on GitHub Well-explained design choices with respect to base models in ensemble. Clever use of data from other pond competition. The reflections on partial progress and future research.

Potential Changes: Would be nice to see more discussion of model performance, particularly if there are any parts of the data where it’s more or less effective. Would also be nice to see something besides just words: charts, numbers, etc are often helpful to me.

Judge 2: 5/7

Great approach to try to leverage the Etherscan API and external datasets. Thoughtful and interesting ensemble but missing performance metrics. Easy to read.

Judge 3: 4/5

Nice ensemble with practical data checks and plans for external data integration. I liked the approach of having one model per chain which is different from other submissions

@omniacs.dao

Judge 1: In my top 6

Pros: Writeup displays personality, authenticity, and authority. Insightful description of process, including important sanity checks and hygiene to ensure the model is usable The descriptions of engineered features are so clear.and informative, giving clear insight into the expected behaviors of sybils.

Potential Changes: It’s always nice to have diverse representations of insight. Some plots or numbers could help make the presentation even more precise and impactful.

Judge 2: 6/7

Excellent write-up and I really appreciate the offerings in the ‘feature buffet’ which demonstrate earned insight into sybil behavior. Missing performance metrics.

Judge 3: 5/5

Great depth and clarity, weaving advanced behavioral analysis, temporal features, and graph-based insights into the model.

@jpegy

Judge 1: Pros: Writeup feels authentic and shows reflection on various approaches. Use of Benford’s Law is a nice potential heuristic – it’s also useful information for others, to know whether or not it applies. Description of process is clear, and methodology seems appropriate.

Potential Changes: It would be nice to know more about the process of training Catboost, and why exactly it overestimated the number of sybils (whether or not this can be fixed). Did the model perform.better than a constant guesser? etc
8 appreciate the graphic, but: 1. standard feature importances has known flaws, and 2. the inclusion of all features really squashes the figure, reducing interpretability.

Judge 2: 4/7

Does the basics well and then incorporates external dataset for filtering and takes a swing at using Benford’s law. Not clear how that measurably affected model performance. Could demonstrate greater rigor.

Judge 3: 4/5

Creative with novel features like Benford’s law and interaction patterns. The write up lacks a bit of details compared to other great submissions, but was an interesting one

1 Like

Hi judges,

Thanks a lot for the great feedback and for organizing this challenge, it was a really enjoyable experience and a great opportunity to apply some of the techniques I’ve refined over the past years.

Some parts of my approach might seem a bit “black box”, but that’s intentional. Since winning the OpenDataCommunity x Gitcoin hackathon back in 2023, I’ve been continuously improving my tooling. In the end of 2023 I was already detecting Sybil patterns via CEX withdrawals and other subtle markers, well before many others started catching on. That’s why I prefer to keep certain heuristics private, especially in public writeups.

I’d be happy to share more context or discuss specific aspects of the pipeline — especially with the judges — but would prefer to keep those conversations out of the public forum.

Best regards, Benjamin

Dear Judges,

Thank you for providing the detailed review of the preliminary submissions. I truly appreciate the comprehensive feedback shared.

I was wondering if the final prize winners have been announced yet, as the previous email mentioned they would be announced this week.

Thanks,
Rhythm