Write Up for Models Predicting Sybil Scores of Wallets

thedevanshmehta · April 4, 2025, 7:02am

Hello Modeloors,

Consider this thread as your home for sharing all things related to your submissions in the Octant Sybil analysis challenge where you need to predict sybil scores of wallets.

Your write-up here is required for being eligible to receive a prize

We encourage you to be visual in your submissions to show weights given to the models, share your juypter notebooks or code used in the submissions, other datasets that are useful for other participants and any other information you deem valuable to participants

Since write-ups can be made after submissions close, other participants cannot copy your methodology in the round. You can take cues for writeups here from another competition we held, along with ideas for creating your own model.

The format of submissions is open ended and free for you to express yourself the way you like. You can share as much or as little as you like, but you need to write something here to be considered for prizes.

Good luck predictoooors

Rhythm2211 · April 21, 2025, 5:05pm

Sybil Detection Model Writeup: Crypto Pond Competition (Highly Detailed Analysis)

Participant: Rhythm Suthar

Email : rhythmsuthar123@gmail.com

Contact : +91 7046830999

Date: April 19, 2025

1. Introduction

1.1. Problem Statement & Motivation

Sybil attacks represent a persistent and significant threat within decentralized Web3 ecosystems. These attacks involve a single adversary creating and controlling numerous fake identities (wallets) to illegitimately amplify their influence, exploit incentive mechanisms, or disrupt system operations. The consequences are far-reaching, undermining the fairness of token distributions (airdrops), compromising the integrity of decentralized governance votes, distorting grant funding allocations, and potentially enabling other malicious activities like Distributed Denial of Service (DDoS) attacks or market manipulation. Effectively detecting and mitigating Sybil attacks is therefore crucial for the security, equity, and long-term viability of blockchain protocols and applications. This project addresses this challenge directly by aiming to develop a high-fidelity machine learning model capable of identifying Sybil wallets based on their on-chain behavioral footprint.

1.2. Competition Context

This work was undertaken as part of the “Sybil Detection with Human Passport and Octant” competition hosted on CryptoPond. Sponsored by Human Passport by Holonym, Octant, and the Ethereum Foundation, the competition provided a valuable dataset and a clear objective: leverage historical blockchain data to build a model that assigns a probability score (from 0 for non-Sybil to 1 for Sybil) to potentially suspicious wallet addresses.

1.3. Approach Overview

To achieve the competition’s objective, a rigorous, multi-stage approach was adopted, emphasizing deep data understanding, comprehensive feature engineering, robust modeling techniques, and iterative optimization. The core stages included:

Data Loading, Cleaning, and Consolidation: Ingesting and preparing the provided multi-chain datasets.

Exploratory Data Analysis (EDA): Performing extensive analysis to identify differentiating characteristics between labeled Sybil and Non-Sybil accounts.

Feature Engineering: Constructing a wide array of quantitative features capturing various dimensions of on-chain behavior.

Modeling: Selecting, training, and evaluating powerful Gradient Boosting Machine (GBM) models suitable for the tabular feature set.

Optimization & Ensembling: Tuning model hyperparameters and combining multiple models to maximize predictive accuracy and robustness.
This report details each stage and the findings therein.

2. Data Description and Preparation

2.1. Data Sources

The analysis utilized a rich dataset provided by the competition organizers, spanning wallet activities on both the Base and Ethereum networks. The key components were supplied as Parquet files:

Labeled Training Data (train_addresses.parquet): Provided the ground truth for model training. Contained wallet addresses (ADDRESS) and their corresponding binary labels (LABEL, where 1 indicates Sybil and 0 indicates Non-Sybil). The competition description noted labels were aggregated from diverse sources including Gitcoin Passport stamps, LayerZero Sybil reports, zkSync Sybil reports, Optimism (OP) Sybil reports, Octant contributions, and internal Gitcoin ban lists.

Unlabeled Test Data (test_addresses.parquet): Contained the target wallet addresses for which final Sybil probability predictions were required. This set included approximately 9.8k unique addresses.

Transaction Data (transactions.parquet): Contained detailed records for individual blockchain transactions initiated by (FROM_ADDRESS) or sent to (TO_ADDRESS) relevant wallets. Key fields included BLOCK_NUMBER, BLOCK_TIMESTAMP, TX_HASH, VALUE (ETH), TX_FEE, GAS_PRICE, GAS_USED, GAS_LIMIT, INPUT_DATA, STATUS/TX_SUCCEEDED, and chain-specific L1 fee details for Base. Over 1.3 million transaction records were present in the combined dataset.

Token Transfer Data (token_transfers.parquet): Detailed ERC-20 style token movements, linked to transactions. Included BLOCK_TIMESTAMP, TX_HASH, initiating address (ORIGIN_FROM_ADDRESS), token CONTRACT_ADDRESS, effective sender (FROM_ADDRESS), receiver (TO_ADDRESS), token SYMBOL, DECIMALS, transfer AMOUNT, and estimated AMOUNT_USD. This was the largest dataset, with over 4.5 million combined records.

DEX Swap Data (dex_swaps.parquet): Recorded swaps on decentralized exchanges linked to user transactions. Included BLOCK_TIMESTAMP, TX_HASH, initiating address (ORIGIN_FROM_ADDRESS), DEX CONTRACT_ADDRESS, POOL_NAME, PLATFORM identifier, tokens involved (TOKEN_IN, TOKEN_OUT, SYMBOL_IN, SYMBOL_OUT), amounts (AMOUNT_IN, AMOUNT_OUT), and estimated USD values (AMOUNT_IN_USD, AMOUNT_OUT_USD). The combined dataset contained approximately 685k swap records.

2.2. Data Preparation Steps

Several preparation steps were necessary before analysis and feature engineering:

Data Consolidation: For each activity type (transactions, transfers, swaps), the data from the Base and Ethereum files were concatenated into a single pandas DataFrame. A chain column (‘base’ or ‘ethereum’) was added to these activity tables during this process to preserve network origin.

Type Conversion: Columns intended for numerical analysis but loaded as objects (e.g., BLOCK_NUMBER, NONCE, GAS_USED, DECIMALS) were converted to appropriate numeric types using pd.to_numeric(errors=‘coerce’). Timestamp columns (BLOCK_TIMESTAMP) were converted to datetime objects using pd.to_datetime(). Boolean-like columns (TX_SUCCEEDED) were mapped to integers (1/0).

Training Label Cleaning: The raw training address file contained duplicate addresses. An analysis confirmed that labels for these duplicates were consistent. Therefore, duplicates were removed using drop_duplicates(subset=[‘ADDRESS’], keep=‘first’), resulting in a final training set of 99,067 unique addresses. The LABEL column was also converted to integer type.

Imbalance Assessment: The cleaned training data exhibited significant class imbalance, with Sybils (LABEL=1) constituting only ~2.5% of the unique addresses. This imbalance necessitated specific handling strategies during modeling (e.g., scale_pos_weight).

3. Exploratory Data Analysis (EDA) - Detailed Findings

A comprehensive EDA was performed to deeply understand the behavioral differences between the labeled Sybil and Non-Sybil populations. This involved visualizing feature distributions and comparing statistical properties, leading to critical insights that guided feature engineering and modeling.

3.1. Activity Levels & Value

Finding: Sybil accounts consistently demonstrated significantly higher on-chain activity volume and value movement compared to Non-Sybil accounts.

Evidence:

Box plots comparing the log-transformed counts (log(count + 1)) for outgoing transactions (tx_out_count), incoming transactions (tx_in_count), token transfers (tt_count), and DEX swaps (ds_count) clearly showed higher medians, interquartile ranges (IQRs), and upper whiskers for the Sybil group (Label 1).

Bar plots comparing the mean counts further emphasized this disparity; for example, the mean tx_out_count for Sybils was nearly 6 times higher than for Non-Sybils (~17.9 vs. ~3.0). Similar large differences were observed for tt_count (~13.2 vs. ~3.9) and ds_count (~5.2 vs. ~2.2).

Log-transformed box plots for value summation features (tx_out_value_sum, tx_in_value_sum, tt_amount_usd_sum, ds_amount_in_usd_sum) showed distributions heavily shifted towards higher values for Sybils. Non-Sybil value distributions were tightly clustered near zero (log(1)).

Interpretation: This suggests Sybil strategies often involve a high frequency of actions and/or moving larger amounts of assets, potentially related to farming, multi-account coordination, or attempts to meet activity thresholds.

Plots:

3.2. Network Interaction Breadth

Finding: Sybil accounts interacted with a significantly wider and more diverse network of counterparties and smart contracts.

Evidence:

Log-transformed box plots and mean comparison bar plots showed Sybils had considerably higher counts for unique destination addresses (tx_out_unique_to_addr_count) and unique source addresses (tx_in_unique_from_addr_count). The mean number of unique outgoing destinations for Sybils (~10.6) was over 8 times higher than for Non-Sybils (~1.3).

Similar significant differences were observed for unique token contracts interacted with (tt_unique_contracts), unique token symbols transferred (tt_unique_symbols), unique DEX platforms used (ds_unique_platforms), and unique DEX pools interacted with (ds_unique_pools).

Interpretation: This pattern might indicate Sybil operators managing interactions across a large set of controlled addresses, participating in numerous different protocols or token ecosystems simultaneously (e.g., airdrop hunting across many projects), or using intermediary contracts/addresses, leading to broader connectivity compared to typical users who might have more focused interaction patterns.

3.3. Temporal Patterns

Finding: The temporal characteristics of Sybil accounts in this dataset were distinct, suggesting longer-term, persistent activity rather than solely ephemeral behavior.

Evidence:

Account Age (account_age_days): Box plots revealed that the median account age for Sybils was significantly higher than for Non-Sybils. The IQR for Sybils was also shifted towards older ages.

Recency (days_since_last_activity): Conversely, Sybils exhibited much more recent activity. The median number of days since the last recorded on-chain action was substantially lower for Sybils, with their distribution tightly clustered near zero, whereas Non-Sybils showed a much wider spread extending to longer periods of inactivity.

Duration (activity_duration_days): Correspondingly, the median duration between the first and last observed activity was significantly longer for Sybil accounts.

Interpretation: This profile suggests that many Sybils in this labeled set might not be simple, short-lived bots created for a single event, but potentially represent established addresses used over extended periods for ongoing activities, perhaps adapting strategies over time. Their recent activity further implies continuous operation.

Plot:

3.4. Chain Preference

Finding: A very strong pattern emerged regarding the preferred blockchain network for activity.

Evidence: Histograms of the base_tx_ratio (proportion of outgoing transactions on Base) showed two distinct peaks at 0 (all Ethereum) and 1 (all Base). For Non-Sybil accounts (Label 0), both peaks were substantial, although the peak at 0 (Ethereum) was larger. For Sybil accounts (Label 1), however, the distribution was overwhelmingly dominated by a massive peak at 1 (Base), with very few Sybils showing primarily Ethereum activity or mixed activity.

Interpretation: This indicates a strong tendency for Sybil accounts within this specific dataset and timeframe to focus their activities on the Base network. This could be due to lower fees, specific incentive programs, or particular protocols targeted on Base during the period covered by the data. This feature appears highly discriminative.

Plot:

3.5. Transaction Costs/Efficiency

Finding: While Sybils incurred higher total fees/gas due to higher volume, their average cost per transaction showed subtle differences.

Evidence: Log-scaled box plots for the mean outgoing transaction fee (tx_out_fee_mean) and mean gas used (tx_out_gas_used_mean) showed slightly lower median values for Sybil accounts compared to Non-Sybils.

Interpretation: This could tentatively suggest that Sybil transactions, on average, might be computationally simpler (e.g., basic transfers vs. complex DeFi interactions requiring more gas) or that Sybil operators employ strategies (potentially automated) to optimize for lower gas prices or fees more consistently than average users.

3.6. Feature Correlations

Finding: Several groups of engineered features exhibited high positive correlations.

Evidence: The correlation heatmap revealed strong positive correlations (>0.8) between:

Mean and median for value-based features (e.g., tx_out_value_mean / median).

Various count metrics across different activity types (e.g., ds_count / tt_count).

Different uniqueness metrics (e.g., ds_unique_pools / tt_unique_contracts).

Temporal features like account_age_days and activity_duration_days.

Interpretation: High correlations indicate some potential redundancy between features. For instance, mean and median value features capture very similar information. While tree-based models can handle multicollinearity, this information could be used for feature selection if model simplification or further optimization were needed.

Plot:

3.7. Missing Values

Finding: The presence and pattern of missing values (NaNs) in the aggregated features were strongly correlated with the Sybil label.

Evidence:

Calculation of missing percentages showed high rates for features derived from specific activities (especially DEX swaps, followed by token transfers and transactions), primarily indicating a lack of that activity type for an address.

The missing no matrix plot, sorted by label, visually demonstrated that Non-Sybil accounts (Label 0, bottom part of plot) had significantly more missing data across nearly all feature categories compared to Sybil accounts (Label 1, top part). Sybils were much more likely to have some activity recorded across transactions, transfers, and swaps.

Interpretation: This crucial insight suggests that a lack of broad on-chain engagement (resulting in NaNs for many aggregated features) is itself a characteristic distinguishing Non-Sybils from the more broadly active Sybils in this dataset. This implies that how NaNs are handled during modeling is important; they contain predictive information.

Plot:

4. Feature Engineering

Guided by the EDA, a comprehensive feature engineering process was undertaken to transform the raw, time-series activity data into a static feature vector for each address suitable for machine learning models. Approximately 55 numerical features were constructed.

4.1. Feature Categories

Basic Aggregates: These captured the overall volume and central tendency of core activities. Included metrics like tx_out_count, tx_in_count, tt_count, ds_count, and statistical summaries (sum, mean, median, std) for numerical fields like VALUE (ETH), TX_FEE, GAS_USED, GAS_PRICE, AMOUNT_USD, AMOUNT_IN_USD, AMOUNT_OUT_USD. Aggregations were performed separately based on the address acting as the source (e.g., tx_out_* from FROM_ADDRESS) or destination (e.g., tx_in_* from TO_ADDRESS) where applicable.

Uniqueness Counts: Quantified the breadth of an address’s interactions using nunique() aggregations. Examples include tx_out_unique_to_addr_count, tx_in_unique_from_addr_count, tt_unique_contracts, tt_unique_symbols, ds_unique_platforms, ds_unique_pools, ds_unique_tokens_in, ds_unique_tokens_out. These aimed to measure network connectivity and diversity of protocol/token usage.

Temporal Features: Captured the lifecycle and timing patterns of account activity. This involved calculating the overall account_age_days (time since first observed activity), days_since_last_activity (time since last observed activity), and activity_duration_days (time between first and last activity) by combining min/max timestamps across all activity types. Similar duration and recency metrics were also calculated specifically for token transfers (days_since_last_tt, tt_activity_duration_days) and DEX swaps (days_since_last_ds, ds_activity_duration_days).

Chain Preference: Based on the chain column added to the raw transaction data, features like base_tx_ratio and ethereum_tx_ratio were calculated as the proportion of an address’s outgoing transactions occurring on each respective chain.

Activity Ratios: Ratios were engineered to capture relative behaviors and potentially normalize for overall activity level. Examples include tx_val_out_in_ratio, tx_count_out_in_ratio, tx_unique_out_addr_ratio, tx_unique_in_addr_ratio, tx_failed_ratio, ds_tt_count_ratio (swap vs transfer frequency), ds_tt_usd_sum_ratio (swap vs transfer value), and activity_ratio (active duration relative to total age).

Specific Counts: Counts of transfers involving key ecosystem tokens (tt_weth_count, tt_usdc_count, tt_usdt_count) were included as potentially indicative features.

4.2. Final Preparation

After merging all engineered features, intermediate timestamp columns used only for temporal calculations were dropped. A final preparation step addressed potential numerical issues before modeling:

Inf Handling: Replaced any infinity values (potentially resulting from ratio calculations with near-zero denominators) with NaN.

Clipping: Clipped extremely large positive or negative finite values to the approximate limits of float32 (divided by 10 for safety) to prevent issues in XGBoost.

NaN Imputation: Filled all remaining NaN values (primarily resulting from addresses lacking specific types of activity, or missing USD price data) with the distinct numerical value -1. This strategy allows tree-based models to potentially learn from the pattern of missingness itself, treating it differently from a genuine zero value.

5. Modeling Approach

A robust modeling strategy was employed, focusing on powerful gradient boosting algorithms and best practices for validation and ensembling.

Validation Strategy: Stratified 5-Fold Cross-Validation served as the cornerstone for model evaluation and generating out-of-fold (OOF) predictions. Stratification by the LABEL column ensured that the severe class imbalance (~2.5% Sybil) was preserved in each train/validation split, leading to more reliable AUC estimates and preventing folds from having zero or very few Sybil examples. Area Under the ROC Curve (AUC) was used as the primary optimization and evaluation metric, as it effectively measures a model’s ability to rank positive instances higher than negative ones, which is suitable for imbalanced classification.

Handling Imbalance: The scale_pos_weight parameter, available in LightGBM, XGBoost, and CatBoost, was used to counteract the class imbalance. It was set to the ratio of negative class count to positive class count (~38.2), effectively increasing the weight (importance) of correctly classifying the minority Sybil class during model training.

Models: Three state-of-the-art Gradient Boosting Machine (GBM) implementations were selected for their high performance on tabular data:

LightGBM: Chosen for its speed, efficiency, and excellent predictive power. Hyperparameters were rigorously tuned using the Optuna library, performing a search over 42 trials where each trial involved a full 5-fold CV evaluation optimizing for mean AUC.

XGBoost: A widely adopted and powerful GBM library. It was trained using a competitive default parameter set, providing model diversity.

CatBoost: Known for its robustness and unique handling of categorical features (though less critical here as features were numeric). Trained with competitive default parameters to add further diversity to the ensemble.

GPU Acceleration: Training for all three models was performed utilizing GPU acceleration (device=‘gpu’/‘cuda’/task_type=‘GPU’) to significantly reduce computation time on the large feature set.

Ensembling: A Weighted Average Ensemble was constructed as the final predictive model. The predictions from the tuned LightGBM, XGBoost, and CatBoost models (generated via their respective 5-fold CV processes on the test set) were averaged together. The weights assigned to each model were proportional to their individual OOF AUC scores relative to a baseline AUC of 0.5 (weight_i = (AUC_i - 0.5) / sum(AUC_j - 0.5)). This approach gives slightly more influence to models that demonstrated better OOF performance.

6. Results

The comprehensive modeling pipeline yielded exceptionally high performance, validating the effectiveness of the engineered features and chosen algorithms.

Individual Model Performance (OOF AUC): Each of the three GBMs achieved outstanding OOF AUC scores on the full feature set, demonstrating strong individual predictive capabilities:

Tuned LightGBM OOF AUC: 0.996945

XGBoost OOF AUC: 0.996921

CatBoost OOF AUC: 0.996928

The remarkable consistency across these diverse GBM implementations underscores the strong signal captured by the engineered features.

Final Ensemble Performance (OOF): The weighted average ensemble (which resulted in near-equal weights due to the very similar individual AUCs) produced the following robust OOF performance:

Weighted Ensemble OOF AUC: 0.997309

This score, while marginally lower than the absolute best single model OOF in this specific run, represents a highly reliable estimate of generalization performance and benefits from the combined strengths of three models.

Classification Metrics (at 0.5 probability threshold):

Accuracy: 0.9908 (~99.1%) - High, but influenced by imbalance.

Sybil Recall: 0.97 - The ensemble correctly identified 97% of the true Sybil accounts (missing only 72 out of 2528). This high recall is critical for effective Sybil detection.

Sybil Precision: 0.75 - When the ensemble predicted an account was Sybil, it was correct 75% of the time. The remaining 25% (838 addresses) were False Positives.

Sybil F1-Score: 0.84 - A strong harmonic mean of precision and recall.

Confusion Matrix: The matrix quantified the trade-off: very few missed Sybils (False Negatives = 72) at the expense of a moderate number of misclassified Non-Sybils (False Positives = 838).

Prediction Distribution: Visual analysis confirmed excellent separation between the predicted probabilities for the two classes, with most predictions concentrated very close to 0 or 1.

Key Feature Importances: Analysis of feature importances (primarily from tuned LGBM, averaged across folds) revealed the most influential factors driving the model’s predictions:

Top Tier: Temporal features (account_age_days, days_since_last_tt, days_since_last_activity, tt_activity_duration_days), Transaction Cost/Efficiency (tx_out_gas_used_mean, tx_out_gas_price_mean, tx_out_fee_sum), and key Ratio features (tx_unique_out_addr_ratio, tx_count_out_in_ratio).

Highly Important: Other significant features included value summaries (tx_out_value_median), other cost metrics (tx_out_fee_mean), value ratios (tx_val_out_in_ratio), overall activity duration (activity_duration_days), and chain preference ratios (ethereum_tx_ratio, base_tx_ratio).

This ranking strongly aligns with the EDA findings, confirming that account lifecycle, recency, activity breadth, transaction efficiency, and chain choice were the most powerful predictors in this dataset.

7. Discussion & Conclusion

This project successfully engineered a highly effective machine learning solution for Sybil detection tailored to the specific dataset and objectives of the CryptoPond competition. The detailed Exploratory Data Analysis was instrumental in identifying key behavioral differentiators, notably the higher activity, broader network interaction, distinct temporal profiles (older, more consistently active Sybils), and strong Base chain preference exhibited by labeled Sybil accounts.

A comprehensive feature set was constructed to quantify these observations. The application of tuned and diverse Gradient Boosting Models (LightGBM, XGBoost, CatBoost) within a robust Stratified K-Fold cross-validation framework yielded outstanding individual model performance, with OOF AUC scores approaching 0.997.

The final 3-model weighted ensemble produced a state-of-the-art OOF AUC of 0.9973. Critically, at a standard 0.5 decision threshold, the ensemble achieved an excellent Sybil recall of 97%, demonstrating its capability to identify the vast majority of malicious actors defined within this dataset. The corresponding Sybil precision of 75% represents a reasonable trade-off, although the optimal balance might be adjusted depending on the specific costs associated with False Positives versus False Negatives in a real-world deployment.

While the model exhibits strong performance on the provided data, certain limitations should be acknowledged. The model’s effectiveness is inherently tied to the quality and representativeness of the initial Sybil labels; different labeling methodologies could yield different results. Furthermore, sophisticated adversaries continuously adapt their strategies (concept drift), potentially requiring model retraining or feature updates over time.

Future work could explore avenues for marginal improvement, such as incorporating external data (e.g., known malicious contract lists, CEX deposit address heuristics), developing complex graph-based features using Graph Neural Networks to explicitly model wallet interactions, or implementing more advanced ensembling techniques like stacking. However, given the near-perfect OOF AUC achieved, the current feature set and ensemble likely capture the bulk of the predictive signal present in this specific dataset.

In conclusion, the developed 3-model weighted ensemble provides a powerful, data-driven solution for this Sybil detection task. The rigorous methodology, combining deep EDA, comprehensive feature engineering, model optimization, and robust ensembling, resulted in a model demonstrating exceptional performance in identifying Sybil behavior based on on-chain activity. The final submission (submission_ensemble_weighted_3model.csv) encapsulates this optimized solution.

achankun · April 25, 2025, 2:29pm

Sybil Detection Model Writeup by achankun

Participant: achankun
Email: ichsanbit45@gmail.com

Overview

This writeup describes my approach to building a machine learning model for detecting Sybil wallets in the Ethereum ecosystem. The solution combines feature engineering from blockchain transaction data with a LightGBM classifier to predict the probability of an address being a Sybil wallet.

Data Preparation

The dataset included labeled wallet addresses from both Base and Ethereum chains, along with their transaction histories:

Address Data:
- Training set: 104,016 addresses (combined Base + Ethereum)
- Test set: 19,584 addresses
Transaction Data:
- Regular transactions (1.4M records)
- Token transfers (4.5M records)
- DEX swaps (685k records)

Key preprocessing steps:

Combined Base and Ethereum chain data
Standardized column names across datasets
Converted numeric columns to appropriate types
Validated the target variable (is_sybil) to ensure binary labels (0/1)

Feature Engineering

I created three categories of features for each wallet address:

1. Transaction Features

Count, sum, mean, std, and median of transaction values
Gas price statistics (mean, std, median)
Gas used statistics (mean, std, median)
Number of unique blocks interacted with

2. Token Transfer Features

Count, sum, and distribution statistics of token amounts
USD value statistics (when available)
Number of unique recipient addresses

3. DEX Swap Features

Count and amount statistics for swap inputs/outputs
USD value statistics for swaps
Number of unique tokens swapped

All features were merged by wallet address, with missing values filled as 0 (assuming no activity in that category).

Model Selection

I chose LightGBM for several reasons:

Handles tabular data effectively
Robust to feature scales and types
Efficient with large datasets
Built-in handling of class imbalance

Model Configuration:

lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    class_weight='balanced',
    metric='auc'
)

Training Approach:

80/20 stratified split for validation
Early stopping after 50 rounds without improvement
AUC as the evaluation metric

Results

The model achieved a validation AUC of 0.923, demonstrating strong ability to distinguish between Sybil and legitimate wallets.

Top 10 Features by Importance:

tx_value_count (number of transactions)
token_amount_precise_count (number of token transfers)
tx_gas_used_mean (average gas used)
dex_amount_in_count (number of DEX swaps)
token_to_address_nunique (unique recipients)
tx_value_sum (total ETH transacted)
tx_block_number_nunique (unique blocks)
token_amount_usd_sum (total USD value)
tx_gas_price_mean (average gas price)
dex_amount_out_sum (total tokens received)

Key Insights

Activity Patterns Matter: The count of transactions and token transfers were most predictive
Economic Signals: Total value transacted and gas usage provided strong signals
DEX Behavior: Swap activity was particularly indicative of Sybil behavior
Network Diversity: Wallets interacting with more unique addresses/blocks were less likely to be Sybils

Conclusion

This solution demonstrates that Sybil detection can be effectively automated using transaction pattern analysis. The LightGBM model successfully learned meaningful patterns from wallet activity data while handling the inherent class imbalance.

Future improvements could include:

Graph-based features capturing wallet connections
Time-series analysis of transaction patterns
Ensemble approaches combining multiple models

ujju513 · April 30, 2025, 2:19pm

Participant: Ujjwal kumar

Email : gautamujjwall513@gmail.com.

Model Approach for Predicting Sybil Scores of Wallets
Objective:
The goal of this project was to predict the Sybil Scores of wallets. Sybil scores are used to identify fraudulent or “Sybil” accounts that may not behave like legitimate users in a system. A higher Sybil score indicates that a wallet is more likely to be fraudulent or manipulated.

Data Processing:
The dataset consists of wallet attributes such as transaction history, wallet balances, and activity patterns. The following data preprocessing steps were performed:

Missing Values: Any missing values in the dataset were handled through imputation using the mean or median, depending on the feature type.

Feature Encoding: Categorical variables, such as wallet types or transaction categories, were encoded using one-hot encoding or label encoding.

Feature Scaling: Continuous variables were normalized to ensure all features were on the same scale, using min-max normalization or standardization.

Data Splitting: The data was split into training and validation sets using stratified K-fold cross-validation to ensure that each fold contains a similar distribution of Sybil and non-Sybil wallets.

Feature Engineering:
Several new features were engineered from the raw data to capture wallet behavior patterns:

Transaction Frequency: The number of transactions per day/month/year to capture how active the wallet is.

Average Transaction Size: This feature captures the typical size of transactions made by a wallet.

Balance Patterns: Whether the wallet shows unusual balance fluctuations (could indicate potential manipulation).

Activity Patterns: The times during which a wallet is most active (e.g., if a wallet shows activity at unusual times, it might indicate suspicious behavior).

Modeling:
For the predictive task, we selected LightGBM (Light Gradient Boosting Machine), a gradient boosting model. LightGBM is known for its efficiency in handling large datasets and its ability to provide accurate results with minimal tuning. Here’s a brief overview of the steps followed in the modeling process:

Model Choice: LightGBM was chosen for its speed, efficiency, and strong performance on structured datasets. It naturally handles missing values and can work with large datasets.

Hyperparameter Tuning: The hyperparameters were tuned to ensure optimal performance. Key parameters tuned included:

Learning rate: 0.01

Maximum depth: 7

Number of leaves: 31

Regularization parameters (L1 and L2) to prevent overfitting.

Cross-validation: To assess the performance of the model, 5-fold cross-validation was used, with AUC (Area Under the Curve) and binary log-loss as the evaluation metrics.

Performance:
The model achieved excellent performance, with an AUC of 0.936 on the validation set. The performance was consistent across different folds, with the mean AUC score being 0.9364. The binary log-loss was 0.0614, indicating that the model’s predictions were well-calibrated.

Fold 5 AUC: 0.9363 (indicating consistency in performance across folds)

Mean AUC: 0.9364

The results suggest that the model is able to effectively distinguish between legitimate and fraudulent wallets, as indicated by the high AUC score.

Challenges Faced:
Some challenges encountered during model development included:

Imbalanced Classes: The dataset had an imbalance between legitimate and Sybil wallets. To mitigate this, class weights were adjusted during model training, and sampling techniques (e.g., oversampling) were considered to balance the dataset.

Feature Selection: Identifying which features were most influential in predicting Sybil scores was an iterative process. Extensive feature engineering was necessary to capture meaningful patterns in wallet behavior.

Next Steps:
To further improve the model, the following steps could be explored:

Model Comparison: While LightGBM performed well, other models like XGBoost, CatBoost, or even deep learning models could be tested for comparison to see if they can yield better performance.
Hyperparameter Optimization: More exhaustive hyperparameter tuning, possibly using Bayesian optimization or randomized search, could be employed to fine-tune the model further.

Ensemble Methods: Combining multiple models through an ensemble approach (e.g., stacking or bagging) may improve prediction stability and generalization.

Conclusion:
The model developed demonstrates strong performance in predicting Sybil scores for wallets, with an AUC of 0.9364 and a binary log-loss of 0.0614. The use of LightGBM was effective, and future improvements could involve exploring different model architectures, fine-tuning hyperparameters, or using ensemble methods for even better performance. This model has the potential to be deployed in systems where detecting fraudulent or Sybil wallets is critical.

OG_KIRILL_F12 · April 30, 2025, 2:19pm

MY SOLUTION

Hi everyone

My nickname on Pond: OG_KIRILL_F12

TG: @llirik02468

EMAIL: polikir2@gmail.com

In this post, I would like to share my solution to the competition**: Sybil Detection with Human Passport and Octant.**

Introduction

I hope that all readers of the post have already familiarized themselves with the task, but I will briefly describe what is the matter. The objective is to build a machine learning model that predicts the probability of a given wallet address being a sybil, using historical blockchain data. The data was divided into 2 groups: base and ethereum. They were the same in structure and stored different information about transactions, each data group had 2 mini-groups of data: dex_swaps, token_transfers, transactions.

Table of contents

1 - Feature generation:construction_worker_man:

2 - Feature selection:backhand_index_pointing_right:

3 - Model building

Feature generation

The dataset contains numeric and categorical features.

Processing numeric data: min, max, std, avg, percentage (.1, .25, .75, .9)
Decided to convert categorical features as follows, took the most popular options for the category and calculated the frequency of this category for each user.
I also decided to calculate graph metrics.

Depth 1

Depth 2

Feature selection

The total number of features turned out to be > 1000. Using too many features can lead to noise and overfitting, so I decided to make an additional selection.

1 step - Null selection

All features with more than a fraction of empty values .99% were thrown out.

2 step - IV Selection

Features with IV>0.1 remained

3 step - Correlants

I left features with correlation |Pearson| < 0.8

I was able to reduce the number of signs by almost 10 times.

Building a model

Splitting the data

I’ve split the data in an 80/20 train and test ratio.

Also, I used KFold cross-validation.

Loss Function

I used the Focal loss for class imbalance

More in https://arxiv.org/pdf/1708.02002v2

Final model

I used blending of different models. Logistic registration, LightGBM, CatBoost. (5 LR, 10 LGBM, 5 CB).

Results

Test AUC: 0.9991688

ROC-AUC

PR-AUC

All metrics

If the model is going to make a decision about the account, then I would choose the threshold that is the maximum for F1 F1, F1 = 0.941176, Precision = 0.963441, Recall = 0.919918.

Thank you for reading, I hope it was useful to you.

bigbrother · May 15, 2025, 7:16pm

Blockchain Sybil Detection System Enhancement: bigbrother

participant: bigbrother

email: lakelynnaq2022@gmail.com

1. Introduction

1.1. Problem Background

Sybil attacks represent a persistent threat in blockchain ecosystems, where attackers create multiple fake identities to manipulate systems, exploit incentives, or disrupt network operations. Effective identification of these Sybil accounts is essential for maintaining the integrity, security, and fairness of blockchain protocols. This report details enhancements to an existing Sybil detection system by integrating false negative correction mechanisms, addressing a critical gap in current detection methods. By identifying and rectifying accounts incorrectly classified as non-Sybil (false negatives), the system delivers improved detection accuracy and stronger resistance against sophisticated evasion techniques.

1.2. Enhancement Objectives

This enhancement aims to augment the baseline Sybil detection system with capabilities to handle false negatives. The primary goal was to integrate false negative correction functionality while maintaining the original system’s structure and operational flow, ultimately creating a more comprehensive and accurate detection framework.

1.3. Methodology Overview

To achieve the enhancement objectives, we employed a systematic approach emphasizing code analysis, feature integration, and system design coherence. The main stages included:

Requirements Analysis: Detailed analysis of how false negatives impact Sybil detection and identification of solution approaches.

Functionality Integration: Extraction of key components related to false negative handling and adaptation to the baseline system.

System Enhancement: Methodical incorporation of new functionality with existing system components, ensuring overall coherence and compatibility.

Structural Refinement: Fine-tuning the integrated code to ensure consistent naming conventions, logging practices, and exception handling approaches.

2. System Architecture and Enhancement

2.1. Original System Structure

The baseline Sybil detection system (b1.py) followed a structured object-oriented design, organized around the SybilDetector class with a well-defined workflow:

Initialization & Configuration: Setting up data paths, logging, and system parameters.

Blockchain Data Loading: Reading and processing transaction, token transfer, and DEX swap data from Ethereum and Base chains.

Feature Engineering: Extracting and calculating a comprehensive set of behavioral features for each address.

Model Training: Implementing a RandomForest classifier with balanced class weights to handle the inherent imbalance in Sybil detection.

Prediction Generation: Producing probability scores for test addresses and formatting results for submission.

This pipeline effectively captured basic Sybil behaviors but lacked mechanisms to handle false negatives - addresses incorrectly labeled as non-Sybil in the training data.

2.2. Enhancement Components

Our enhancement focused on three key components:

False Negative Data Loading: Functionality to load, validate, and process a CSV file containing known false negative addresses (false_negatives_Ben2k.csv).

Label Correction: Mechanisms to identify matches between training addresses and known false negatives, updating their labels to reflect their true Sybil status.

Feature Augmentation: Additional feature engineering inspired by insights from false negative analysis, capturing subtle behavioral patterns that might otherwise be missed.

3. Implementation Approach

3.1. False Negative Data Loading

The core approach for false negative data loading includes:

Using pandas to read the CSV file with robust error handling
Intelligently detecting and handling different column name formats (such as ‘Address’ or its variants)
Standardizing the address column to a unified format (‘ADDRESS’)
Outputting verification information to confirm correct data loading
Providing clear fallback mechanisms when files don’t exist or have incorrect formats

3.2. Label Correction Mechanism

The key approach for label correction includes:

Converting both training addresses and false negative addresses to uppercase for case-insensitive matching
Utilizing efficient set data structures for address lookups
Identifying matching addresses and updating their labels to Sybil (1)
Logging detailed change statistics, such as how many labels were corrected
Providing rich diagnostic output to verify the correction process

3.3. Advanced Feature Engineering

The feature engineering approach based on false negative analysis includes:

Calculating token contract to transaction volume ratios, capturing contract diversity patterns
Creating address interaction density features, measuring network closeness
Analyzing DEX transaction behavior, identifying anomalous exchange patterns
Developing activity concentration metrics, detecting high-frequency activity in short timeframes

These features specifically target behavioral patterns observed in addresses initially misclassified as non-Sybil, enabling the model to capture more subtle indicators of Sybil activity.

4. Results and Impact

The integration of false negative correction mechanisms substantially enhances the Sybil detection system’s capabilities. While specific quantitative improvements depend on the dataset characteristics, the enhanced system delivers several key benefits:

4.1. Improved Classification Accuracy

By correctly re-labeling addresses that were initially misclassified as non-Sybil, the system provides a more accurate training dataset. This directly translates to improved model learning, reducing the likelihood of similar misclassifications in the future. The label correction statistics provided in system logs quantify the extent of this improvement.

4.2. Enhanced Feature Set

The additional features derived from false negative analysis significantly expand the model’s capacity to identify subtle Sybil behaviors. These features particularly focus on aspects that might have contributed to the initial misclassification, such as:

Token contract diversity patterns
Address interaction network characteristics
DEX usage frequency relative to other activities
Temporal concentration of activities

This expanded feature set enables the model to capture more nuanced behavioral signatures, improving detection even for sophisticated Sybil strategies designed to evade basic detection methods.

5. Conclusion

The enhanced Sybil detection system represents a significant advancement in addressing false negatives - a critical weakness in many detection systems. By systematically integrating methods to identify and correct misclassified addresses, the system delivers more accurate training data, more comprehensive feature engineering, and ultimately more reliable Sybil detection.

The implementation demonstrates careful attention to robustness, efficiency, and integration with existing components. The enhanced features derived from false negative analysis provide particular value in capturing subtle behavioral patterns that might otherwise escape detection. This approach creates a more complete and accurate representation of Sybil behavior, enabling the model to identify a wider range of deceptive strategies.

Future enhancements could further expand this approach by incorporating additional data sources for false negative identification, implementing adaptive feature generation based on emerging Sybil patterns, or developing more sophisticated network analysis techniques. However, the current integration already delivers substantial improvements in detection accuracy and system robustness.

In conclusion, the enhanced Sybil detection system effectively addresses the critical challenge of false negatives, providing a more accurate and comprehensive approach to identifying malicious actors in blockchain ecosystems.

egideons · May 15, 2025, 7:16pm

Participant: Gideon Chukwuoma

Email: gideon.dart@gmail.com

Contact: +234 703 950 2751

Date: May 14, 2025

Blockchain Sybil Detection Model: Building a High-Confidence Classifier

Introduction

For this project, I developed an AI model to detect Sybil wallets in blockchain networks, focusing on Ethereum and Base chains. Sybil wallets are fake or duplicate addresses created to manipulate airdrops, voting systems, and other blockchain mechanisms. The challenge was not only to classify wallets accurately but also to achieve extremely high confidence scores (0.99+) for Sybil predictions.

           Legitimate User                     Sybil Attacker
               (👤)                                (😈)
                |                                   |
        ┌───────┴───────┐                  ┌───────┴───────┐
        ↓               ↓                  ↓       ↓       ↓
  Real Wallet 1    Real Wallet 2     Fake Wallet  Fake    Fake
     (0x123)         (0x456)          (0x789)    Wallet  Wallet
                                                 (0xabc)  (0xdef)

Figure 1: Illustration of a Sybil attack where one entity controls multiple fake identities

Data Exploration & Understanding Sybil Behavior

I began by exploring the dataset structure, which contained:

Labeled Sybil addresses (~2,500)
Transaction data
Token transfers
DEX swaps

Analyzing the data revealed several key Sybil patterns:

Low diversity ratios (repeated interactions with the same addresses)
Unusual transaction frequencies
Short account lifespans with high activity
Repetitive token transfer patterns

These patterns became the foundation of my feature engineering strategy.

┌─────────────────┐       ┌─────────────────┐
│  Transactions   │◀─────▶│    Addresses    │
│  ───────────    │       │   ───────────   │
│  TX_HASH        │       │   ADDRESS       │
│  FROM_ADDRESS   │       │   LABEL (0/1)   │
│  TO_ADDRESS     │       │   CHAIN         │
│  VALUE          │       └─────────────────┘
│  BLOCK_TIMESTAMP│
└────────┬────────┘
         │
         ▼
┌─────────────────┐       ┌─────────────────┐
│ Token Transfers │       │    DEX Swaps    │
│  ───────────    │       │   ───────────   │
│  TX_HASH        │       │   TX_HASH       │
│  FROM_ADDRESS   │       │   TOKEN_IN      │
│  TO_ADDRESS     │       │   TOKEN_OUT     │
│  CONTRACT_ADDR  │       │   AMOUNT_IN     │
│  AMOUNT         │       │   AMOUNT_OUT    │
└─────────────────┘       └─────────────────┘

Figure 2: Overview of blockchain data structure used for feature extraction

Model Architecture and Pipeline

The complete model pipeline follows this architecture, from data preprocessing to the final high-confidence predictions:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│  Raw Blockchain │     │   Engineered    │     │    Ensemble     │
│      Data       │────▶│    Features     │────▶│     Models      │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
                                                         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│   Submission    │◀────│   Confidence    │◀────│  Base Model     │
│     Results     │     │    Boosting     │     │  Predictions    │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Figure 3: End-to-end model pipeline architecture

Feature Engineering Approach

I created three categories of features to capture Sybil behaviors:

Basic Blockchain Metrics

Transaction counts (in/out)
Unique interaction addresses
Token transfers and value metrics
DEX swap activities

Time-Based Features

Account age (in days)
Transaction frequency (per day)
Token transfer frequency (per day)

Pattern-Based Features

Transaction diversity ratio (unique addresses / total transactions)
Token diversity ratio (unique tokens / total transfers)
Transaction repetition patterns
In/out ratios for transactions and tokens

Feature Importance Ranking
--------------------------
tx_diversity_ratio      ████████████████████  1.00
account_age_days        ██████████████       0.72
token_diversity_ratio   █████████████        0.65
tx_repeat_ratio         ████████████         0.59
tx_frequency            ███████████          0.53
sybil_indicator_1       █████████            0.45
token_frequency         ████████             0.41
token_out_unique_count  ███████              0.36
tx_out_unique_count     ██████               0.29
swap_total              █████                0.24

Figure 4: Relative importance of key features in Sybil detection

I also designed six specific Sybil indicators based on common Sybil patterns:

High transaction count with low diversity
High token transfers with low token diversity
Multiple DEX swaps with minimal token variety
High transaction repetition ratio
High transaction frequency on young accounts
High token frequency with low diversity

Sybil vs. Human Wallet Patterns

The patterns between typical Sybil and human wallet behavior differ significantly:

                   Sybil vs. Human Wallet Characteristics
                                              
                      Diversity    Account     Tx        Token     Unique
                       Ratio       Age      Frequency  Diversity  Patterns
                         ▲          ▲          ▲         ▲          ▲
                         │          │          │         │          │
Human Wallets      ──●──┼──────────┼──────────┼─────────┼──────────┼──● 
                         │          │          │         │          │
                         │          │          │         │          │
                         │          │          │         │          │
Sybil Wallets      ─────┼●─────────┼●─────────┼●────────┼●─────────┼●─ 
                         │          │          │         │          │
                         ▼          ▼          ▼         ▼          ▼
                        LOW        SHORT      HIGH      LOW        HIGH

Figure 5: Comparison of key metrics between Sybil and legitimate wallets

Technical Challenges & Solutions

Challenge 1: Label Format Error

The initial model failed with a “Unknown label type” error. I discovered the labels were stored as Decimal objects rather than binary integers. I implemented a robust label conversion system that:

Detects the label data type
Maps string or non-binary numeric labels to 0/1 format
Verifies binary class distribution before training

┌───────────────┐      ┌────────────────┐      ┌────────────────┐
│ Raw Labels    │      │ Label Type     │      │ Conversion     │
│ (from data)   │─────▶│ Detection      │─────▶│ Strategy       │
└───────────────┘      └────────────────┘      └───────┬────────┘
                                                        │
                                                        ▼
┌───────────────┐      ┌────────────────┐      ┌────────────────┐
│ Verify Binary │      │ Convert to     │      │ Check Class    │
│ Distribution  │◀─────│ Integer Type   │◀─────│ Balance        │
└───────────────┘      └────────────────┘      └────────────────┘

Figure 6: Label conversion process flow

Challenge 2: Timestamp Processing

When calculating account age features, I encountered a comparison error between Timedelta objects and integers. I resolved this by:

Adding proper type detection for time differences
Using the .total_seconds() method for Timedelta objects
Implementing fallback conversions for other timestamp formats

Challenge 3: Performance Bottlenecks

The initial code was extremely slow when processing large datasets, particularly with DEX swaps. I optimized by:

Replacing loops with vectorized operations
Processing addresses in batches (5,000 at a time)
Using pandas’ groupby operations instead of filtering in loops
Adding progress tracking for long-running operations

Performance Optimization Results
--------------------------------
                        Original Code   Optimized Code   Improvement
                        (time in sec)   (time in sec)    Factor
                          
Process DEX Swaps      ████████████     ██               6.0x faster
                          (300s)         (50s)
                          
Feature Creation       ██████████████   ██               7.0x faster
                          (350s)         (50s)
                          
Transaction Analysis   ████████         ██               4.0x faster
                          (200s)         (50s)
                          
Token Transfer Proc.   ██████           ██               3.0x faster
                          (150s)         (50s)

Figure 7: Performance improvements from code optimization

Model Training & Ensemble Approach

The dataset was highly imbalanced (99,985 non-Sybil vs. 4,035 Sybil wallets). To address this, I:

Calculated appropriate class weights
Used balanced sampling in tree-based models
Trained three complementary models:
- LightGBM: 1,000 trees, depth 10, learning rate 0.01
- XGBoost: 500 trees, depth 8, learning rate 0.01
- CatBoost: 500 iterations, depth 8 (had configuration issues)

The final ensemble used a weighted average (0.4 LightGBM, 0.3 XGBoost, 0.3 CatBoost) to leverage the strengths of each model.

  ┌───────────────┐       ┌───────────────┐       ┌───────────────┐
  │   LightGBM    │       │    XGBoost    │       │    CatBoost   │
  │    Model      │       │     Model     │       │     Model     │
  └───────┬───────┘       └───────┬───────┘       └───────┬───────┘
          │                       │                       │
          ▼                       ▼                       ▼
  ┌───────────────┐       ┌───────────────┐       ┌───────────────┐
  │  Predictions  │       │  Predictions  │       │  Predictions  │
  │    (0-1)      │       │    (0-1)      │       │    (0-1)      │
  └───────┬───────┘       └───────┬───────┘       └───────┬───────┘
          │                       │                       │
          │                       │                       │
          ▼                       ▼                       ▼
          │                       │                       │
          │                       │                       │
          └─────────────┬─────────┴───────────┬──────────┘
                        │                     │
                        ▼                     ▼
                   Weight: 0.4           Weight: 0.3     Weight: 0.3
                        │                     │             │
                        └─────────────────────┴─────────┬──┘
                                                        │
                                                        ▼
                                              ┌──────────────────┐
                                              │  Final Ensemble  │
                                              │   Predictions    │
                                              └──────────────────┘

Figure 8: Weighted ensemble model architecture

Confidence Boosting Strategy

To achieve 0.99 probability scores, I implemented a strategic boosting approach:

┌───────────────┐     ┌───────────────────────┐     ┌───────────────┐
│  Base Model   │     │ Sybil Pattern Boosts  │     │  Confidence   │
│  Predictions  │────▶│ ┌───────────────────┐ │────▶│  Thresholds   │
│  (0.0 - 1.0)  │     │ │tx_diversity < 0.2 │ │     │ >0.85 → 0.99  │
└───────────────┘     │ │    +0.2 boost     │ │     │ <0.15 → 0.01  │
                      │ └───────────────────┘ │     └───────────────┘
                      │ ┌───────────────────┐ │
                      │ │  sybil_indicator  │ │
                      │ │ matches (1-6) add │ │
                      │ │  +0.1-0.15 each   │ │
                      │ └───────────────────┘ │
                      └───────────────────────┘

Figure 9: Confidence boosting methodology

This approach successfully pushed 2,404 predictions to high confidence (>0.9), with many reaching the target 0.99 score.

Results & Performance

The final model identified 9,236 potential Sybil wallets (22.7% of test addresses), with 2,404 high-confidence predictions. The maximum prediction score achieved was 0.99, meeting the project’s ambitious target.

                    Distribution of Prediction Scores
 4000 │                                                │
      │                                                │
      │                                                │
 3500 │█                                             █ │
      │█                                             █ │
      │█                                             █ │
 3000 │█                                             █ │
      │█                                             █ │
 2500 │█                                             █ │
      │█                                             █ │
      │█                                             █ │
 2000 │█                                             █ │
      │█                                             █ │
 1500 │█                                             █ │
      │█                                             █ │
 1000 │█                 █  █                        █ │
      │█                 █  █  █                     █ │
  500 │█     █  █  █     █  █  █                     █ │
      │█  █  █  █  █  █  █  █  █  █  █  █  █  █  █  █ │
    0 │█████████████████████████████████████████████ │
      └┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬┘
       0.0      0.2      0.4      0.6      0.8     1.0
                        Prediction Score

Figure 10: Distribution of prediction scores in the test dataset

Key performance metrics:

Model training time: ~5 minutes
Memory usage: ~4GB
Class imbalance handling: Effective (despite 96% non-Sybil in training)

Future Improvements

Given more time, I would implement:

Graph-based features to capture network relationships between wallets
More sophisticated temporal patterns analysis
Contract interaction analysis to identify suspicious smart contract usage
Cross-chain behavior correlation

┌──────────────────────────────────────────────────────────────────┐
│                    Current Model Architecture                     │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
                                 ▼
┌───────────────┐  ┌─────────────────────────┐  ┌─────────────────┐
│  Graph-Based  │  │    Enhanced Temporal     │  │  Cross-Chain    │
│   Features    │  │    Pattern Analysis      │  │  Correlation    │
└───────┬───────┘  └───────────┬─────────────┘  └────────┬────────┘
        │                      │                         │
        │          ┌───────────▼────────────┐            │
        └──────────┤  Advanced Ensemble     ├────────────┘
                   │  Model Architecture    │
                   └───────────┬────────────┘
                               │
                               ▼
                   ┌────────────────────────┐
                   │   Explainable Sybil    │
                   │ Detection with Higher  │
                   │   Confidence Scores    │
                   └────────────────────────┘

Figure 11: Proposed architecture for future model improvements

By combining blockchain domain knowledge with machine learning techniques, I successfully built a model that not only detects Sybil wallets with high accuracy but also achieves the challenging target of 0.99 probability scores for confident predictions.

AMDYES · May 15, 2025, 7:16pm

Submission

Pond Username: AMD YES

Email: lenanilsson133@gmail.com

Date: May 15, 2025

What’s This Project About?

I made this machine learning thing to catch Sybil attacks…you know, those fake wallet addresses trying to scam stuff on Ethereum and Base blockchains. I looked at how wallets send transactions, move tokens, and trade on DEXs to figure out who’s legit and who’s faking it.

What I Was Trying to Do

I wanted a model that checks a wallet address and gives a score from 0 to 1 on how likely it’s a Sybil attacker. I trained it with a bunch of past data from Ethereum and Base.

Digging into the Data

What Data I Used

Training Data: ~2,500 labeled Sybil addresses to teach the model.
Test Data: 10,000 addresses I had to predict.
Data Files:
transactions.parquet: Transaction history.
token_transfers.parquet: ERC-20 token movements.
dex_swaps.parquet: DEX trade records.
train_addresses.parquet: Labeled training addresses.
test_addresses.parquet: Addresses to predict.

Fixing Bad Labels

This dude Ben2k had a list of false negatives—addresses labeled wrong. Using that to fix the training data made my model way more accurate. (new added…)

Cooking Up Features

I came up with features to help the model understand wallet behavior. Here’s what I did:

Things I checked:

sent_tx_count: Number of transactions sent.
sent_tx_gas_used_mean: Average gas used.
sent_tx_unique_receivers: Number of different wallets they sent to.
tx_lifetime_days: How long they’ve been active.
tx_sent_received_ratio: Sent vs. received transaction ratio.

2. Token Transfer Stuff

Token patterns:

sent_token_unique_contracts: Different token types used.
token_usd_sent_received_ratio: Value sent vs. received.
token_diversity_ratio: How varied tokens are per transaction.

3. DEX Trading Moves

DEX behavior:

origin_swap_count: Number of swaps started.
dex_activity_ratio: How often they trade.
origin_swap_unique_tokens_in_out: Variety of swapped tokens.

4. Extra Cool Features

Extra Cool Features

Network Bits:
network_diversity: Total addresses they deal with.
address_interaction_density: How tight their network is.
activity_concentration: How focused their activity is.
Value Flow:
value_exchange_ratio: Sent vs. received value balance.
token_contracts_to_transactions_ratio: Token variety vs. activity.
Time Stuff:
activity_concentration: Spotting activity spikes.
Lifetime stats for different actions.

Model Setup

Trying Different Models

Random Forest: Good baseline, ranks features well.
Gradient Boosting: Catches tricky patterns step-by-step.
XGBoost: Fancy boosting with tweaks for uneven data.
LightGBM: Fast and handles categories like a champ.

What I Found

Sybil Behavior

Transactions:
Tons of transactions, short lifespan.
Cheap gas to save cash.
Fewer unique contacts.
Tokens:
Use just a few token types.
Weird value flows.
Fast token flips, like farming.
DEX Trading:
Trade in quick bursts.
Stick to certain token pairs.
Trade in tight time slots.

How I Built It

Data Pipeline


# My steps:

1. Loaded Ethereum and Base data.

2. Fixed labels with Ben2 help.

3. Made features.

4. Added fancy features.

5. Cleaned data.

6. Trained and validated model.

7. Predicted results.

Keeping It Solid

Missing Data: Filled with zeros or handled weird cases.
Bad Values: Swapped infinities, filled gaps.
Scaling: Tamed extreme values.
Cross-Chain: Mixed Ethereum and Base data smoothly.

How It Did

Validation AUC: Rocked the holdout data.
Feature Selection: Auto-picked the best features.
Generalization: Handled all address types well.

Wrapping Up

My Sybil detection model mixes ML with blockchain smarts to catch bad actors in Web3. With Ben2k’s label fixes and Ethereum/Base data, it’s a trusty tool to keep decentralized apps safe.

This thing’s good to go for Web3, blending tech chops with real-world use.

debiao · May 15, 2025, 7:16pm

Sybil Detection with Human Passport and Octant

Author: debiao

Overview

It may be difficult to find the exact every sybil address, but it could be easy to find most of them. I will show you how to do it with nearly the simplest statistical method. Yes, only statistics, not machine/deep learning. TXs is all you need.

Methods

As an airdrop hunter, I have learned a lot from the ‘Alpha-Seeking KOLs’. To put it bluntly, most of them are sybils. Based on my observations, their onchain activities have some obvious characteristics. The most crucial point is the number of transactions, especially TXs on Ethereum mainnet. So my method is very simple. I just need to analyze the number of TXs of known sybil addresses to find statistical patterns.

The only data file I used is transactions.parquet. And as said before, I only care about Ethereum mainnet TXs. I calculated the number of TXs of every know sybil address through this file and then fit a normal distribution to it. The most important line of code in my method is as follows:

mu, std = norm.fit(eth_sybils_tx)

The following figure shows the result of plotting.

As this figure shows the distribution of the number of TXs of known sybil address, we just need to calculate the number of TXs of every test address and find its position in this distribution.

Conclusion

As of the time of writing this report, I received a high score of 0.8094 and ranked 8th. I admit that there may be some luck, and the limitations and upper limits of this method are also very obvious. However, it is still an interesting attempt and may provide inspiration for us to design more complex models.

0xZphr · May 15, 2025, 7:17pm

Participant: Zephyr
Email: zephyrweb3@gmail.com

Sybil Address Detection Using On-Chain Behavioral Features

Overview

This project aims to build a machine learning model that predicts the probability of a wallet address being a Sybil, based on historical blockchain activity. A Sybil address typically exhibits behavior associated with manipulation, multi-wallet farming, or coordinated exploitation. To detect such patterns, we analyze both Ethereum and Base chains, utilizing multiple types of on-chain data.

Data Sources

We leverage five core datasets from both Ethereum and Base blockchains:

dex_swaps.parquet: Contains on-chain DEX trading data per transaction.
token_transfers.parquet: Captures ERC-20 token transfers.
transactions.parquet: General transaction metadata including gas and fees.
train_addresses.parquet: Labeled dataset of known Sybil (LABEL = 1) and non-Sybil (LABEL = 0) addresses.
test_addresses.parquet: Unlabeled addresses to score for Sybil probability.

These datasets are first merged across chains and then processed to extract features at the wallet address level.

Feature Engineering

We compute behavioral features for each address from three primary activity categories:

Transaction Activity:

Number of transactions initiated.
Average and maximum transaction fees paid.
Duration of activity (days between first and last transaction).

Token Transfer Behavior:

Total and average value of tokens sent and received.
Number of unique addresses interacted with (both sent to and received from).
Diversity of tokens transacted.

DEX Swap Activity:

Count of swaps initiated.
Average USD value in and out per swap.
Number of unique platforms and token pairs used.

All features are aggregated using group-by operations on FROM_ADDRESS, TO_ADDRESS, or ORIGIN_FROM_ADDRESS, depending on the dataset.

Missing or infinite values are replaced with zero to ensure model stability.

Modeling Approach

We use a Random Forest Classifier for binary classification, predicting the probability of an address being a Sybil. The reasons for selecting Random Forest are:

Strong performance on tabular, imbalanced data.
Robustness to noisy or missing features.
Interpretability in terms of feature importance.

Model training includes:

Train/validation split (80/20).
Evaluation using ROC-AUC score.

Performance

The model achieves a Validation ROC-AUC of ~0.994, indicating strong discriminatory power between Sybil and non-Sybil addresses.

Scoring and Output

Test addresses are scored using the trained model.
Addresses without sufficient on-chain history for feature extraction are assigned a default score of 0.
Final output is saved in sybil_predictions.csv with two columns: ADDRESS, SCORE.

Limitations & Future Improvements

Feature extraction is based solely on observable behavior; Sybils with well-camouflaged activity may go undetected.
Model doesn’t incorporate graph-based or temporal sequence patterns, which could improve detection of coordinated activity.
Incorporating contract interactions, known farming events, or external Sybil lists may increase precision.

hutchersonkeeland · May 15, 2025, 7:17pm

Sybil Detection with XGBoost for Octant Challenge

Username on Pond: hutchersonkeeland

EMAIL: hutchersonkeeland6062@gmail.com

Thanks a lot, Pond! I graduated with a degree in Computer Science and I am very happy to share my experience with you!

Overview

In the Octant Sybil analysis challenge, the goal was to predict sybil scores for wallet addresses based on transaction, token transfer, and DEX swap data. My approach leverages feature engineering and an XGBoost classifier to model the likelihood of an address being a sybil. Below, I detail my methodology, share key insights, and provide visualizations to illustrate the process.

Methodology

Data Loading and Preprocessing:
- I loaded the provided datasets (tx_data, token_data, swap_data, train_set, test_set) using randomized file names to ensure reproducibility in a dynamic environment.
- The target column (LABEL) in the training set was cleaned to ensure binary classification (0 for non-sybil, 1 for sybil). Invalid labels were removed, and the distribution was checked to confirm data quality.
Feature Engineering:
- I extracted a variety of features from the three datasets to capture behavioral patterns of wallet addresses:
  - Transaction Features: Number of transactions, total ETH sent, average ETH per transaction, and number of distinct recipients.
  - Token Transfer Features: Number of token transfers, total USD value, and number of unique tokens.
  - DEX Swap Features: Number of swaps, total USD value, and number of unique tokens swapped.
  - Time-Based Features: Duration of activity (days active) and transaction frequency (transactions per day).
- Missing values were filled with 0 to handle inactive addresses, ensuring robust feature sets for both training and test data.
Model Training:
- I used an XGBoost classifier with the following hyperparameters:
  - n_estimators=150, learning_rate=0.05, max_depth=5, objective=‘binary:logistic’, eval_metric=‘auc’.
- The data was split into 80% training and 20% validation sets, stratified to maintain label distribution.
- The model was trained on the training set and evaluated on the validation set.
Prediction and Submission:
- Features were extracted for the test set, and the trained model predicted sybil probabilities.
- The submission file was saved as a CSV with address-probability pairs, using a randomized filename for consistency.

Key observations:

num_transactions and distinct_recipients were among the most important features, suggesting that sybil wallets may exhibit distinct transaction patterns.
token_usd_total and swap_usd_total also contributed significantly, indicating that high-value token activity could be a sybil indicator.

Code and Reproducibility

The full code is available in my Jupyter notebook, shared here: [link_to_notebook_if_uploaded]. The notebook includes data loading, feature extraction, model training, and prediction steps. I used pandas for data manipulation, scikit-learn for splitting data, and xgboost for modeling. The random seed was set to 123 for reproducibility.

Additional Insights

Challenges: Handling missing data for inactive addresses required careful imputation (using 0) to avoid bias. The randomized file names added complexity but ensured flexibility in data loading.
Potential Improvements: Incorporating network-based features (e.g., graph analysis of address interactions) or additional external datasets (e.g., on-chain reputation scores) could enhance performance.
Useful Resources: I found the feature engineering ideas from the previous competition write-ups (linked in the challenge description) particularly inspiring for crafting time-based and aggregation features.

Conclusion

My approach combined robust feature engineering with a well-tuned XGBoost model to predict sybil scores. The feature importance analysis provided interpretable insights into sybil behavior, which I hope benefits other participants.

Feel free to reach out with questions or suggestions!

Candy · May 19, 2025, 3:29pm

Design and Implementation

AUTHOR：Candy

EMAIL：c1hanrongsan@outlook.com

1. Project Overview

This report details the design approach, implementation methods, challenges encountered, and solutions of our Sybil detection system. The system is based on real transaction data from Ethereum and Base chains, analyzing wallet address behavior patterns to build a robust prediction model that can effectively distinguish between normal users and Sybil attackers.

2. System Architecture

2.1 Data Structure

Blockchain transaction data: Including transaction hash, sender address, receiver address, transaction value, gas usage, etc.
Token transfer data: Recording ERC-20 token transfers, including sender, receiver, token contract address and transfer amount
DEX exchange data: Recording token exchange activities on decentralized exchanges, including exchange tokens, amounts and related addresses

2.2 System Flow

The system workflow includes the following steps:

Data loading and preprocessing: Reading transaction data from Ethereum and Base chains
Feature engineering: Extracting and constructing key features for identifying Sybil addresses
Model training: Training classification models using CatBoost algorithm
Prediction and evaluation: Making Sybil probability predictions for unlabeled addresses and evaluating model performance

2.3 Multi-threading Architecture

For improving processing efficiency, the system adopts advanced multi-threading parallel computing architecture. This is the core optimization part of the system, significantly improving data processing speed and model training efficiency.

2.3.1 Multi-threading Design Principles

Blockchain data processing is a computation-intensive task, especially in the feature extraction stage. Our multi-threading architecture is based on the following principles:

Task decomposition: Decomposing feature engineering into independent subtasks
Resource balancing: Dynamically adjusting the number of threads based on CPU cores
Synchronization control: Using Future pattern to manage dependencies between multi-threaded tasks
Resource isolation: Ensuring data independence between threads, avoiding race conditions

2.3.2 Multi-threading Implementation

The system uses Python’s concurrent.futures library to implement multi-threading processing, mainly involving the following components:

# Pseudocode: Multi-threading feature generation system
def generate_features(eth_data, base_data, fn_addresses, n_jobs=-1):
    """Multi-threading feature engineering main function"""
    # Data preparation
    all_addresses = prepare_address_list(train_addresses, test_addresses)
    
    # Create thread pool - dynamically determine thread count
    with ThreadPoolExecutor(max_workers=n_jobs if n_jobs > 0 else None) as executor:
        # Submit three feature generation tasks in parallel
        tx_future = executor.submit(
            generate_transaction_features, 
            all_addresses.copy(), eth_data, base_data
        )
        
        token_future = executor.submit(
            generate_token_features, 
            all_addresses.copy(), eth_data, base_data
        )
        
        dex_future = executor.submit(
            generate_dex_features, 
            all_addresses.copy(), eth_data, base_data
        )
        
        # Wait for all tasks to complete and get results
        tx_features = tx_future.result()
        token_features = token_future.result()
        dex_features = dex_future.result()
    
    # Merge all feature subsets
    all_features = merge_feature_subsets(all_addresses, tx_features, 
                                         token_features, dex_features)
    
    return prepare_final_datasets(all_features, train_addresses, test_addresses)

2.3.3 Thread Coordination Mechanism

To ensure efficient collaboration between threads, we implemented the following coordination mechanisms:

Task boundary definition:
- Transaction features (generate_transaction_features)
- Token transfer features (generate_token_features)
- DEX exchange features (generate_dex_features)
These three feature sets can be computed completely in parallel, without intermediate synchronization points.

Data isolation and replication:

# Create data copies for each thread to avoid shared state
all_features_copy = all_features.copy()

Future object monitoring:

# Start multiple tasks simultaneously and monitor progress
futures = []
for feature_function in feature_functions:
    future = executor.submit(feature_function, data_copy)
    futures.append(future)

# Process completed tasks
for completed in as_completed(futures):
    result = completed.result()
    # Process results...

Dynamic resource allocation:

# Automatically detect CPU cores and allocate threads
n_jobs = multiprocessing.cpu_count() 
# Leave 1-2 cores for system processes
working_threads = max(1, n_jobs - 2)

2.3.4 Multi-threading Performance Optimization

We adopted the following performance optimization techniques in our multi-threading implementation:

Coarse-grained parallelization: Choose larger computation units for parallel processing to reduce thread synchronization overhead
Data preloading: Preload required data before thread startup

Memory optimization:

# Example: Memory usage optimization
def optimize_dataframe(df):
    # Reduce numerical precision
    for col in df.select_dtypes('float64').columns:
        df[col] = df[col].astype('float32')
    
    # Use category type for categorical columns
    for col in categorical_columns:
        df[col] = df[col].astype('category')
    
    return df

Adaptive batching:

# Automatically adjust batch size based on data scale
def process_in_batches(addresses, batch_size=None):
    # If batch size not specified, calculate optimal batch size based on address count
    if batch_size is None:
        batch_size = min(10000, max(1000, len(addresses) // n_jobs))
    
    # Process in batches
    for i in range(0, len(addresses), batch_size):
        batch = addresses[i:i+batch_size]
        yield process_batch(batch)

2.3.5 CatBoost Model Training Multi-threading Optimization

In the model training stage, we also used multi-threading to improve efficiency:

# Pseudocode: Multi-threaded model training
def train_catboost_with_threading(X_train, y_train, X_test):
    # Configure CatBoost to use all available threads
    model = CatBoostClassifier(
        # Other parameters...
        thread_count=-1,  # Use all CPU cores
        task_type='CPU'   # Explicitly specify CPU parallelism
    )
    
    # Use thread-optimized data structures
    train_pool = Pool(
        X_train, 
        y_train,
        thread_count=-1  # Data loading also uses multi-threading
    )
    
    # Training process automatically utilizes multi-threading
    model.fit(train_pool)
    
    return model

2.3.6 Multi-threading Architecture Effect

We measured the performance improvement of the multi-threading architecture:

Processing Stage	Single-thread Time	Multi-thread Time (8 cores)	Speedup Ratio
Feature Extraction	420s	68s	6.2×
Feature Engineering	185s	42s	4.4×
Model Training	320s	78s	4.1×
Total Processing Time	925s	188s	4.9×

These optimizations enable the system to efficiently process large-scale blockchain data, significantly reducing processing time.

3. Feature Engineering

3.1 Basic Features

We extracted rich basic features from three main data sources:

Transaction features:

Token transfer features:

DEX exchange features:

3.2 Advanced Features

Based on the basic features, we constructed a series of advanced features designed to capture behavioral differences between Sybil addresses and normal user addresses:

Pattern recognition features:

Transaction activity density: reflecting the concentration of address transaction activities
Token diversity score: measuring the richness of tokens interacted with by the address
Social network size: reflecting the breadth of interaction between addresses and other addresses
Value flow ratio: analyzing the balance of fund inflow and outflow of addresses
Behavior feature combinations: combining multiple basic indicators into composite pattern recognition features

We found that through these advanced features, we can effectively capture common behavioral patterns of Sybil addresses, such as short-term high-intensity activities, limited external interactions, and unnatural transaction patterns.

The specific calculation methods of these advanced features involve weighted combinations of basic features, ratio calculations and time window analysis, constituting the core discriminative ability of the model.

3.3 Feature Selection

To improve model performance, we implemented feature selection methods:

Variance detection: Removing constant or low-variance features
Feature importance screening: Training preliminary models and selecting feature subsets that contribute 90% importance
Preventing overfitting: Reducing model complexity through feature selection

4. Challenges and Solutions

Data Quality Issues

Challenge: Missing values, outliers, and inconsistent formats in the original data.

Solutions:

Implemented robust data cleaning process
Used RobustScaler to handle outliers
Standardized address format (uniformly converted to lowercase)
Applied quantile clipping for extreme values

Results

The multi-threading architecture provided a 4.9× speedup in total processing time, with feature extraction seeing the most significant improvement (6.2× faster).

The addition of multi-threading architecture significantly improved system performance, while feature selection enhanced model quality. This is the Top 15 Feature importances that I got.

uncharted · May 27, 2025, 8:28pm

The competition aimed to identify Sybil wallets by analyzing on-chain behavior across Ethereum and Base, using a large but weakly populated dataset. Throughout modeling, we discovered that transaction count bucketing significantly boosted AUC by localizing learning—though it struggled in low-sample segments. We also sourced external Sybil labels to enhance training quality and found that improperly parsed timestamps degraded temporal feature reliability, underscoring the need for clean, consistent time data.

Pond username: unchartedwatersfilm76
email:Unchartedwatersfilm76@gmail.com

Phase 1: Data Exploration & Augmentation Summary

We began by inspecting the structure and quality of all datasets provided for both Base and Ethereum chains, which included:

train_addresses.parquet
test_addresses.parquet
transactions.parquet
token_transfers.parquet
dex_swaps.parquet

A dynamic inspection utility was built to load files, report shape, dtypes, missing values, and preview contents.

Key Findings (Base Chain):

train_addresses: 51,515 labeled addresses (Sybil vs. non-Sybil), no missing values.
test_addresses: 20,369 unlabeled addresses, clean for inference.
transactions: 2.1M rows, rich in gas, fee, and address details; missing values in MAX_FEE_PER_GAS and TO_ADDRESS.
token_transfers: 4.1M transfers; ~60% missing in AMOUNT_USD.
dex_swaps: 239K records; AMOUNT_IN/OUT_USD missing in 12–24%.

Ethereum Observations:

Similar structure, larger volumes (e.g., 500K+ DEX swaps).
Unique missing value patterns (e.g., ORIGIN_TO_ADDRESS in swaps).

Dataset Augmentation Using External Sybil Signals

Motivation:

The original dataset was imbalanced (2.55% Sybil), risking poor recall and generalization. To address this, we utilized competition-permitted external Sybil lists.

Sybil Address Sourcing:

Public lists from LayerZero and zkSync yielded 27,910 Sybil suspects.
We filtered to retain only those appearing in provided transactions, token_transfers, or dex_swaps (as FROM/ORIGIN addresses), resulting in 888 high-confidence Sybil addresses with feature coverage.

Dataset Construction:

Merged Base and Ethereum train_addresses (standardized casing and deduplicated).
Final base training set: 99,067 addresses (2,528 Sybils, 96,539 non-Sybils).
Appended the 888 Sybil addresses (LABEL = 1), removed duplicates (preserving the Sybil label), and imputed missing labels as 0.
Final counts:
- 3,416 Sybils
- 96,528 Non-Sybils
- 99,944 total → Sybil rate improved from 2.55% → 3.42%

Artifacts Generated:

merged_train.parquet: unified Base + Ethereum training set
888_sybil_addresses.csv: filtered Sybils with on-chain activity
Final labeled dataset (99,944 entries) for model training

Cross-Chain Feature Aggregation: Base + Ethereum

To capture Sybil behavior across chains, we unified Base and Ethereum datasets into a single address-level feature table. This enabled modeling of multi-chain behavioral profiles—critical for identifying Sybils operating across siloed networks.

Why Cross-Chain?

Sybil actors often use multiple chains to evade detection. Merging data enriches the behavioral signal, helping the model learn generalized and nuanced Sybil patterns.

Step-by-Step Pipeline:

Load + Normalize Address Sets
Unified training (merged_train_with_888_sybil.csv) and test addresses from both chains, normalized to lowercase.
Ingest Event Logs
Loaded transactions, token_transfers, and dex_swaps from Base and Ethereum.
Filter for Relevant Events
Retained only rows involving addresses in the combined training/test set.
Per-Chain Feature Extraction
Extracted address-level stats (counts, values, diversity) per chain with prefixes (base_, eth_) for seamless merge.
Merge Features
Joined Base and Ethereum features on address and filled missing values with 0.
➤ Output: Unified feature matrix for training and inference.

Sample:
address, base_tx_out_count, eth_tx_in_count, …

Step 2: Graph-Based Token Transfer Features (Base & Ethereum)

We engineered graph-based features from token transfer logs to capture transactional dynamics and economic roles of each address.

Preprocessing

Address columns varied (ORIGIN_FROM_ADDRESS, FROM_ADDRESS, etc.), so we prioritized origin fields and normalized all addresses to lowercase.

Graph Features per Chain (4):

in_degree: number of times tokens were received
out_degree: number of times tokens were sent
total_received: sum of tokens received
total_sent: sum of tokens sent

Behavioral Insights:

High out_degree, low in_degree: faucet/distributor
High in_degree, low out_degree: aggregator
High degrees + zero-value sums: potential spam/Sybil flows

Base Extraction

Records: 4.18M
Senders: 101,953
Receivers: 17,453
Unique: 119,406
→ Strong skew toward send-only addresses—common in Sybil bots.

Ethereum Extraction

Records: 3.13M
Senders: 214,565
Receivers: 44,107
Unique: 258,672
→ Many gas-optimized send-only wallets—possible Sybil clusters.

Merge Across Chains

Outer-joined Base + Ethereum transfer metrics on address
Filled NAs with 0
Final size: 367,406 addresses × 9 features

Output saved to: /kaggle/working/graph_metrics.parquet
Sample: address, base_in_degree, eth_total_sent, …

These graph-based signatures capture interaction roles and are a key component in our Sybil detection architecture.

Phase 3: Token Transfer Behavior Features (with Origin Fallback)

We engineered behavioral features from token_transfers.parquet (Base + Ethereum). All address fields were normalized to lowercase. To accurately trace initiators, we used ORIGIN_FROM_ADDRESS and ORIGIN_TO_ADDRESS as fallbacks.

Feature Categories:

Amount Uniformity
For each sender:

amount_unique_vals: unique token amounts sent
amount_std: standard deviation of sent amounts
→ Low variance may indicate Sybil automation.

Token Balance Flow
For each address:

amount_sent, amount_received
out_in_ratio = sent / (received + 1)
→ Detects draining/funnel wallets.

Token Variety

Count of distinct tokens sent/received
token_variety = tokens_sent + tokens_received
→ Low or high values indicate limited usage or airdrop farming.

Result:
Features were merged per address with chain-specific prefixes (base_, eth_).
Final shape: 367,522 addresses, 13 features, ~68 MB.

Step 4: Final Feature Integration & Model Training

Feature Consolidation:

We merged all chain-specific features into a single final_features DataFrame (41 columns, 126,347 addresses), normalized column casing, and merged it into:

train.csv: from (99,943, 30) → (99,943, 70)
test.csv: from (26,563, 29) → (26,563, 69)
→ Saved as step4_train_with_all_features.csv and step4_test_with_all_features.csv.

Model Training: LightGBM

Input:
- X: 68 features (excluding ADDRESS, LABEL)
- y: 99,943 labels
- X_test: for inference
Train/Validation Split:
- 85% train (84,951), 15% validation (14,992)
- LightGBM trained with AUC = 0.9938
Cross-Validation:
- 5-fold stratified CV
- Average AUC = 0.9933, confirming generalization
Test Prediction:
- Saved as test_predictions.csv with ADDRESS and PREDICTION

Key Takeaways

Behavioral token features (uniformity, flow, variety) and graph-based metrics were highly predictive.
Chain-specific consistency via base_ / eth_ prefixing enabled smooth integration.
Final model showed no overfitting, with validation and CV AUC both > 0.993.

Phase 5: Time-Based Feature Engineering Across Chains

We engineered temporal behavior features from Base and Ethereum transaction logs to detect Sybil wallets via activity frequency, regularity, and duration.

Methodology:
We defined a time_based_features function to extract, for each address (as both sender and receiver):

first_tx_time, last_tx_time, time_span
avg_tx_gap, hour_mode, weekday_mode

Function inputs:

tx_data: transaction DataFrame
address_list_path: CSV of all known addresses
prefix: “base” or “eth”

Code:
Applied to each chain:

base_time_features = time_based_features(base_tx, address_list_path, prefix=“base”)

eth_time_features = time_based_features(eth_tx, address_list_path, prefix=“eth”)

Execution Logs:
Base:

126,347 addresses loaded
FROM matches: 1.91M | TO: 225K
Features created: FROM – 38,704 | TO – 26,797
Merged: 40,422 addresses

Ethereum:

126,347 addresses loaded
FROM: 858K | TO: 382K
Features created: FROM – 44,930 | TO – 51,738
Merged: 54,224 addresses

Final Merge:
Joined Base + Ethereum time features on ADDRESS via outer join:
merged_time_features.shape: (83,403, 25)

Sample Output:
Each row shows activity metrics across chains, e.g., base_time_span_from, eth_avg_tx_gap_to, etc.

Summary:
These 25 features capture wallet lifecycle, periodicity, and activity gaps — crucial for distinguishing Sybil behavior (bursty, short-lived) from organic activity. The result feeds into final training and analysis.

Phase: FIX Time-Based Feature Engineering

Objective: Extract address-level temporal behavior signals (e.g., span, gap, hour/day patterns) from transactions on Base and Ethereum to reveal Sybil coordination.

Methodology

Function: time_based_features() computed FROM/TO side features per address with chain prefixing.
Scope: Filtered transactions to 126,347 known addresses (matching FROM_ADDRESS or TO_ADDRESS).
Normalization: Timestamps converted to UTC and sorted.
Features (FROM & TO):

First/last transaction time
Activity span (seconds)
Mean gap between transactions
Mode hour and weekday

Merge & Cleanup: FROM/TO merged per address, filtered to knowns.

Output Summary

Base:

FROM matches: 1.92M
TO matches: 225k
Unique addresses: 40,422

Ethereum:

FROM matches: 858k
TO matches: 382k
Unique addresses: 54,224

Merged:

Total addresses: 83,403
Total features: 25 time-based features

Integrated into final train/test sets:

Train: (99,943 × 118)
Test: (26,563 × 117)

Datetime Format Fix

Issue: LightGBM ignored string/object datetime features.
Fix: Converted all *_TIME / *_DATE columns to UNIX timestamps via preprocess_fixed().

Handles datetime parsing, numeric conversion, and NaN filling
Result: 116 usable columns

Final Pipeline

Reloaded Step 5 features
Preprocessed with preprocess_fixed()
Trained LightGBM with 15% validation split

Validation AUC: 0.9955
Confusion Matrix:

[[14431 49]

[ 89 423]]

Classification Report:

Metric	Non-Sybil	Sybil
Precision	0.9939	0.8962
Recall	0.9966	0.8262
F1-score	0.9952	0.8598

• High Sybil precision = low false positives

Slightly lower recall = tuning opportunity

5-Fold Stratified CV

Performed with StratifiedKFold(n_splits=5)
Avg AUC: ~0.9947
Confirms strong generalization despite class imbalance

Takeaways

Proper datetime formatting unlocked critical time behavior signals
LightGBM + fixed preprocessing → robust Sybil classifier
Ready to pursue stacking, threshold tuning, or graph-informed boosts

Transaction Count Bucketing for Sybil Detection

Motivation and Very Important suggestion to help improve the competition

we thought Initial models struggled due to skewed transaction volumes—non-Sybils rarely exceeded 50 transactions, while some Sybils surpassed 1,000. we thought that this imbalance hurt generalization, especially for high-activity wallets, empirical results showed otherwise.

The best bucket-level AUC (0.9739) was significantly lower than the full-dataset model AUC (>0.997).

Surprisingly, the bucket with the highest wallet count (11–100) scored lower (0.9681) than the 101–500 range (which had fewer wallets), suggesting sample size alone didn’t guarantee better learning, and behavioral separability mattered more.

This contradicts the idea that bucketing enabled “localized learning” or “uncovered new behavior patterns” that helped precision. In fact:

Bucketed models lacked the broader patterns visible in the full dataset.
Cross-bucket generalization was harmed because global patterns (like timing gaps, repeat contract behavior) were underutilized.

Feature Engineering

We computed total transactions per address on both Base and Ethereum:

df[“eth_total_tx_count”] = df[“ETH_TX_IN_COUNT”] + df[“ETH_TX_OUT_COUNT”]

df[“base_total_tx_count”] = df[“BASE_TX_IN_COUNT”] + df[“BASE_TX_OUT_COUNT”]

Then grouped them into defined ranges:

bins = [0, 10, 100, 500, 1000, 5000, float(“inf”)]

labels = [“0–10”, “11–100”, “101–500”, “501–1000”, “1001–5000”, “>5000”]

df[“eth_tx_bucket”] = pd.cut(df[“eth_total_tx_count”], bins=bins, labels=labels)

df[“base_tx_bucket”] = pd.cut(df[“base_total_tx_count”], bins=bins, labels=labels)

Per-Bucket Modeling Strategy

We trained one LightGBM model per bucket (excluding those with insufficient samples). Each dataset was filtered by bucket, missing values handled, categorical data encoded, and models validated using AUC:

if len(bucket_df) < 100: continue # Skip underpopulated buckets

X = bucket_df[feature_cols]; y = bucket_df[“LABEL”]

Encode, fill, split, train, evaluate…

Results

Bucket	Wallets	AUC
0–10	26,202	0.9478
11–100	35,712	0.9681
101–500	8,951	0.9739
501–1000	1,411	0.9603
1001–5000	822	0.9512
>5000	137	Skipped

buckets with a good amount of both sybil and non sybils —especially 101–500—achieved the highest AUCs, offering a balance between behavioral richness and sample size. Some behavioral buckets revealed nearly all Sybils, making them highly predictive.

Conclusion
“We initially tried bucketing by transaction activity to address behavioral heterogeneity, and pin point with more accuracy what differented sybil and non-sybil wallets with very similar transaction count However, it became clear that larger, more holistic datasets allowed better generalization across behavioral regimes. Models trained on the full dataset outperformed bucketed ones—even in high-activity subgroups—suggesting that combining behavior-rich and sparse-wallet regimes into a unified model yields better representations and higher AUCs.”

If we had enough samples in each bucket (e.g., if each had 30,000+ wallets), bucketing might have matched the full model, but with the available data, it fragmented the signal.

Final Model Training after Timestamp Fix & Sybil Pattern Analysis

Motivation & Fixes

Temporal inconsistencies in timestamp-based features—critical for spotting subtle Sybil behaviors—led us to re-parse all timestamps using pd.to_datetime, normalize timezones, and fix NaNs in derived metrics like transaction span and average gap. We saved the cleaned dataset as train_with_fixed_timestamps_full.csv, which also included corrected features such as hour mode and weekday mode.

LightGBM Training Pipeline

Using the cleaned data, we trained a LightGBM classifier with 5-fold Stratified Cross-Validation to preserve label balance.

Preprocessing:
- Dropped raw timestamps and ID columns
- Retained only numerical features
Model Configuration:
- objective=‘binary’, metric=‘auc’, boosting=‘gbdt’
- learning_rate=0.01, num_leaves=64
- Early stopping: 200 rounds
- Subsampling for rows/columns
Validation:
- AUC tracked per fold, best model saved

Fold	AUC
1	0.99581
2	0.99502
3	0.99407
4	0.99556
5	0.99257

Mean AUC: 0.99461 Std Dev: ±0.00118 Best Model: lgb_fold1.pkl

Feature Importance Highlights

The top 40 features (based on importance_type=“gain”) revealed that Sybil detection relies heavily on behavioral, temporal, and graph-based signals. Key contributors included:

eth_txn_avg_gap_seconds
eth_txn_hour_mode
eth_transfer_amount_std
base_contract_interaction_count
eth_graph_cluster_id
eth_total_internal_txns

These confirmed that temporal regularity, usage patterns, and network position are critical signals for identifying Sybil behavior.

TIME FEATURES

1. ETH_AVG_TX_GAP_FROM_x

Behavioral Insight: Sybil wallets show significantly longer average gaps between transactions, with a mean over 7x higher than non-Sybils.
Interpretation: Sybil addresses may exhibit infrequent or batched activity, likely due to script-driven patterns or paused activity between farming tasks.
Detection Value: High — strong temporal separation is a clear flag.

Screenshot 2025-05-26 at 4.13.08 PM1361×1010 101 KB

2. ETH_TIME_SPAN_FROM_x / ETH_TIME_SPAN_FROM_y

Behavioral Insight: Sybil addresses are active over a much longer total lifespan, while non-Sybils often have no timespan (many values = 0).
Interpretation: This suggests Sybils tend to stay active longer or simulate longevity by spacing out activity.
Detection Value: High — 0-duration wallets are strong non-Sybil indicators.

Screenshot 2025-05-26 at 4.14.13 PM1262×1008 104 KB

Screenshot 2025-05-26 at 4.14.45 PM1436×999 107 KB

3. BASE_HOUR_MODE_FROM_x

Behavioral Insight: Sybil wallets show activity clustered at specific hours, with higher variability.
Interpretation: This supports the idea of scripted or batch interactions at off-peak times, possibly to avoid detection.
Detection Value: Medium — time-of-day mode can flag automation.

Screenshot 2025-05-26 at 4.15.33 PM1470×1022 111 KB

INTERACTION GRAPH FEATURES

4. ETH_TX_OUT_UNIQUE_RECEIVERS

Behavioral Insight: Sybils interact with significantly more unique receivers (mean ~7.7 vs. 0.7).
Interpretation: Suggests broadcast behavior, spraying funds across addresses, often to obfuscate or distribute rewards.
Detection Value: Very High — strong evidence of networked behavior.

Screenshot 2025-05-26 at 4.16.20 PM1407×996 100 KB

5. ETH_OUT_DEGREE

Behavioral Insight: Higher out-degree for Sybils (mean 8.7 vs. 2.8), though with much larger variance.
Interpretation: Indicates Sybil wallets often initiate interactions across more targets in the transaction graph.
Detection Value: High — higher connectivity is a red flag, though variance is high.

Screenshot 2025-05-26 at 4.17.58 PM1455×1016 87.6 KB

6. ETH_NUM_UNIQUE_CONTRACTS

Behavioral Insight: Sybil addresses interact with more contracts (mean 4.3 vs. 1.0).
Interpretation: Could suggest airdrop farming across protocols, repeatedly interacting with different contracts.
Detection Value: High — diversity of contract use is telling.

TRANSFER FEATURES

7. ETH_TF_OUT_AVG_AMOUNT

Behavioral Insight: Sybil values are notably smaller and consistent, despite skewed units — Non-Sybils show astronomically large averages (due to outliers).
Interpretation: Suggests Sybils typically move small amounts repeatedly, possibly to simulate legitimate usage or evade thresholds.
Detection Value: Medium-High — unit scaling is needed, but pattern is informative.

8. ETH_TF_OUT_COUNT

Behavioral Insight: Sybil wallets have higher transfer-out frequency (mean 8.5 vs. 2.8).
Interpretation: Regular token movements out may reflect token farming, draining, or scripted reward flows.
Detection Value: High — frequent output is typical of extraction behavior.

9. ETH_TX_OUT_VALUE

Behavioral Insight: Slightly higher for Sybils (6.6 vs. 3.9), but less dramatically.
Interpretation: Transaction values are marginally higher — Sybils may perform multiple low-value extractions.
Detection Value: Moderate — low signal without context.

Screenshot 2025-05-26 at 4.21.41 PM1419×1009 83.8 KB

10. base_total_tx_count

Behavioral Insight: Huge gap — Sybils: 27.7 avg vs. 1.4 for non-Sybils. A 20x difference
Interpretation: High activity on Base is a clear pattern of Sybils spamming transactions for activity thresholds or to blend in.
Detection Value: Very High — activity burst is a direct behavioral fingerprint.

Screenshot 2025-05-26 at 4.23.44 PM1576×1005 90.7 KB

11. eth_total_tx_count

Behavioral Insight: Sybils again dominate (22.0 vs. 3.3).
Interpretation: Similar to Base — aggressive interaction rate across multiple chains is common in Sybil attacks.
Detection Value: Very High — multi-chain spamming is common in incentive manipulation.

Screenshot 2025-05-26 at 4.24.17 PM1389×1013 92.2 KB

12. ETH_TOTAL_SENT

Behavioral Insight: Mean value higher in non-Sybils (due to extreme outliers), but IQR is tighter and higher for Sybils.
Interpretation: Suggests small-to-medium consistent outflows from Sybils, vs. rare huge transfers in real users.
Detection Value: Moderate — use IQR or log scale to normalize and improve insight.

13. BASE_TX_OUT_VALUE

Behavioral Insight: Sybils send more on average (1.39 vs. 0.07), but most non-Sybils have 0.
Interpretation: Sybils simulate active transfer behavior on Base to qualify for criteria.
Detection Value: High — any value vs. 0 is a key discriminant.

14. ETH_TX_OUT_MAX_VALUE

Behavioral Insight: Slightly higher max for Sybils (1.7 vs. 1.2), but with tight IQR.
Interpretation: Even max values are fairly small — supports idea of spamming low-to-mid-value interactions.
Detection Value: Moderate — contributes more as a supporting feature.

OVERALL BEHAVIORAL PROFILE: SYBILS

Trait	Sybil Behavior
Time	Long-lived, bursty, often batched hourly
Transfers	Many small transfers, frequent activity, outflows
Interactions	Broad connectivity, many receivers/contracts
Activity	High transaction counts across chains
Amounts	Mid-to-low amounts, rarely extreme

Key Detection Levers

ETH_AVG_TX_GAP_FROM_x – Big gap = suspicious (script gaps).
ETH_TX_OUT_UNIQUE_RECEIVERS / ETH_OUT_DEGREE – Spray behavior = high suspicion.
base_total_tx_count / eth_total_tx_count – Excessive activity = likely Sybil.
ETH_NUM_UNIQUE_CONTRACTS – Many contracts = likely farmer.
BASE_TX_OUT_VALUE / ETH_TF_OUT_COUNT – Active drain patterns.
ETH_TIME_SPAN_FROM_x – 0 span = inactive user (non-Sybil), long span = suspicious.

IQR Anomalies & Compression Signatures

Across Multiple Features:

Non-Sybil IQRs often collapse to (0.0, 0.0), suggesting many are dormant or inactive (i.e., passive recipients).
Sybil IQRs are consistently wider, showing broader behavioral expression (intentionally or through multi-role activity).

Interpretation:

Non-Sybils are either very new, very old (inactive), or unengaged.
Sybils occupy a middle zone of strategically feigned engagement.

ewohirojuso · May 27, 2025, 8:29pm

pond name: ewohirojuso

ewohirojuso66@gmail.com

Introduction

This project focuses on detecting fake accounts (Sybil) on blockchain networks using machine learning. Basically, we need to identify those batch-registered accounts that are used for airdrop farming and prevent them from gaining unfair advantages in Web3 projects.

Project Overview

We were given approximately 2,500 known Sybil addresses for training, and need to predict whether 10,000 unknown addresses are Sybil or not. The data comes from transaction records, token transfers, and DEX trading data on Ethereum and Base chains.

Data Processing

Data Structure

Two main folders contain the datasets:

ethereum_sybil_detection/ - Ethereum chain data
base_sybil_detection/ - Base chain data

Each folder contains 5 parquet files:

train_addresses.parquet - Training addresses with labels
test_addresses.parquet - Test addresses
transactions.parquet - Regular transaction records
token_transfers.parquet - Token transfer records
dex_swaps.parquet - DEX swap records

Special Handling of False Negative Data

The most special part of this project is the integration of an additional data source: false_negatives_Ben2k.csv. This is a false negative dataset provided by Ben2k, containing addresses that were incorrectly classified as non-Sybil but are actually Sybil addresses.

Data Import Processing

def import_false_negatives(file_path):
    fn_data = pd.read_csv(file_path)
    # Standardize column names, supporting different formats like Address/address/ADDRESS
    if 'Address' in fn_data.columns:
        fn_data = fn_data.rename(columns={'Address': 'ADDRESS'})
    elif 'address' in fn_data.columns:
        fn_data = fn_data.rename(columns={'address': 'ADDRESS'})
    
    # Ensure consistent address format, remove whitespaces
    fn_data['ADDRESS'] = fn_data['ADDRESS'].str.strip()
    return fn_data

Label Update Mechanism

The most critical part is the update_training_labels function, which dynamically updates training set labels:

def update_training_labels(training_data, false_negatives):
    # Record original label distribution
    original_dist = training_data['LABEL'].value_counts().to_dict()
    print(f"Original label distribution: {original_dist}")
    
    # Convert to uppercase for case-insensitive comparison
    training_data['ADDR_UPPER'] = training_data['ADDRESS'].str.upper()
    false_negatives['ADDR_UPPER'] = false_negatives['ADDRESS'].str.upper()
    
    # Find matching address indices
    fn_addr_set = set(false_negatives['ADDR_UPPER'])
    matches = training_data[training_data['ADDR_UPPER'].isin(fn_addr_set)].index
    
    if len(matches) > 0:
        # Record original labels for statistics
        orig_labels = training_data.loc[matches, 'LABEL'].copy()
        # Update to Sybil label (1)
        training_data.loc[matches, 'LABEL'] = 1
        # Calculate actual changes
        changed = (training_data.loc[matches, 'LABEL'] != orig_labels).sum()
        print(f"Updated {changed} labels from non-Sybil to Sybil")

This processing is important because:

Improves data quality: Corrects errors in original annotations
Reduces noise: Prevents model from learning incorrect patterns
Enhances performance: More accurate training data usually leads to better model performance

Technical Details of Address Matching

The code handles many details to ensure accurate address matching:

Case-insensitive: Unified conversion to uppercase for comparison
Whitespace handling: Uses str.strip() to remove leading/trailing spaces
Column name compatibility: Supports different column name formats (Address/address/ADDRESS)
Index operations: Uses DataFrame indices for efficient batch updates

Timestamp Standardization

The code standardizes all time-related fields uniformly:

if 'BLOCK_TIMESTAMP' in all_txs.columns:
    all_txs['BLOCK_TIMESTAMP'] = pd.to_datetime(all_txs['BLOCK_TIMESTAMP'])

This ensures consistency across different data sources and prevents calculation errors.

Feature Engineering

Transaction Feature Extraction

Combined data from both chains, adding chain identifiers to each transaction:

eth_txs['chain'] = 'ethereum'
base_txs['chain'] = 'base'
all_txs = pd.concat([eth_txs, base_txs], ignore_index=True)

Main extracted features include:

Sent transactions: count, total amount, average amount, fees, gas usage, etc.
Received transactions: count, total amount, number of senders, etc.
Cross-chain behavior: transaction ratios on different chains

Token and DEX Features

Similarly processed token transfer and DEX swap data to extract relevant statistical features.

Time Feature Processing

This part is quite complex, using the compute_time_features function to calculate account lifecycle features:

Account age (last activity time - earliest activity time)
Days inactive (current time - last activity time)
Duration of various activity types

Most importantly, all datetime columns are deleted at the end, keeping only numerical derived features to ensure clean model input.

Model Training

Chose LightGBM with main parameters:

Used 5-fold cross-validation and set class weights based on positive/negative sample ratios to handle data imbalance.

Feature Importance Analysis

Based on the feature importance chart from training, the most important features are:

days_inactive - Days since last activity, far exceeding other features in importance
account_age_days - Account age
sent_gas_mean - Average gas usage
sent_tx_fee_sum - Total transaction fees
token_activity_density - Token activity density

Key Findings:

Time features are most important: Top two are time-related, indicating Sybil accounts have distinct temporal behavior patterns
Gas usage reveals automation: Average gas usage ranks third; scripted operations tend to have consistent gas usage
Economic behavior has characteristics: Transaction fees and token activity are also important

From the chart, we can see that days_inactive has a cliff-like lead in importance, likely because many Sybil accounts follow a “use and abandon” pattern.

Data Preprocessing

The code includes comprehensive preprocessing:

Outlier handling: Clipping extreme values using 99.9% quantile
Missing value imputation: Uniform filling with -1
Data type checking: Ensuring all features are numerical
Infinite value handling: Replacing inf and -inf with finite values

Summary

The biggest highlight of this project is the handling of false negative data. By using Ben2k’s additional data to correct training labels, this approach is rare but valuable in real machine learning projects. From the feature importance analysis, temporal behavior is indeed the key factor in distinguishing Sybil from normal users.

xingxiang · May 27, 2025, 8:29pm

Sybil Wallet Detection System

my name: X_FOR_X

x4xingxiang@outlook.com

My Code’s Core Component Design

DataLoader

Goal: Data Integration

Core Functions:

Multi-chain Data Unification: Automatically identifies and loads parquet format data files from Ethereum and Base chains
False Negative Data Integration: Specifically handles the false_negatives_Ben2k.csv file (a new file added later by Pond officials), automatically identifies address columns and performs label correction

Technical Implementation:

@staticmethod
def load_chain_data(data_path):
    """Load data from a single chain"""
    datasets = {}
    files = ['train_addresses', 'test_addresses', 'transactions', 'token_transfers', 'dex_swaps']
    
    for filename in files:
        filepath = data_path / f'{filename}.parquet'
        if filepath.exists():
            datasets[filename] = pd.read_parquet(filepath)
            print(f"Loaded {filename}: {len(datasets[filename])} rows")

FeatureEngine

Design Philosophy: Convert complex blockchain data into feature vectors usable by machine learning

Core Modules:

Safe Data Type Converter
- safe_numeric(): Safely converts arbitrary data to numeric types
- safe_datetime(): Unifies timestamp formats, handles different time representations

Transaction Behavior Feature Extraction

def build_transaction_metrics(self, addresses, eth_data, base_data):
    """Build transaction-related metrics"""
    # Merge multi-chain data
    # Calculate sender aggregation features: transaction count, amount statistics, Gas usage patterns
    # Calculate receiver features: received transaction statistics, sender diversity
    # Generate network preference features: usage tendencies across different chains

Token Transfer Feature Extraction
- Token diversity analysis: contract count and token type statistics
- Major token usage patterns: trading behaviors for WETH, USDC, USDT, DAI
- Value flow analysis: USD value distribution and flow patterns
- Time pattern recognition: temporal features of token activities
DEX Trading Feature Extraction
- Trading efficiency calculation: input-output amount ratio analysis
- Platform usage preferences: usage patterns across different DEX platforms and liquidity pools
- Trading strategy identification: patterns in trade size, frequency, and token selection

Composite Feature Generator

def create_combined_features(self, tx_features, token_features, dex_features):
    """Create composite features"""
    # Activity complexity: complexity of cross-type transaction activities
    # Value flow: overall fund flow scale and patterns
    # Time aggregation: temporal correlations across different activity types

Feature Processing Strategies:

Temporal Feature Numerization: Convert all timestamps to relative days to avoid object type errors
Missing Value Strategy: Use -1 filling to let the model automatically learn the meaning of missing patterns
Feature Engineering Validation: Automatically check the data types and value ranges of generated features

ModelOptimizer

Design Philosophy: Automate model training and optimization processes to ensure optimal performance

Core Functions:

Intelligent Hyperparameter Optimization

def optimization_objective(self, trial, X, y):
    # Use Optuna TPE algorithm for efficient search
    # Streamlined parameter space, focusing on key hyperparameters
    # Fast cross-validation evaluation of candidate parameters

Stratified Cross-Validation
- 3-fold stratified cross-validation ensures training stability
- Automatically handles class imbalance issues
- Generates OOF (Out-of-Fold) predictions for model evaluation
Feature Importance Analysis
- Automatically calculates and ranks feature importance
- Provides business-interpretable feature analysis reports
- Supports feature selection and model optimization decisions

Optimization Strategies:

Class Balance: Automatically calculates positive-negative sample weights through scale_pos_weight
Early Stopping Mechanism: Prevents overfitting by stopping training when validation performance no longer improves
Model Ensemble: Generates averaged predictions from multiple models through cross-validation

Experimental Results

Model Performance

Cross-validation AUC: 0.998122 (near-perfect classification)
Accuracy: 99%
Sybil Detection Recall: 95% (correctly identified 2464 out of 2585 Sybils)
Sybil Detection Precision: 81% (81% of addresses predicted as Sybil are actually Sybil)

Confusion Matrix Analysis

              Predicted
Actual    Normal    Sybil
Normal    95921     561    (False positive rate: 0.58%)
Sybil      121     2464    (False negative rate: 4.68%)

Future Outlook

My idea is that in the future, we can consider integrating more on-chain data sources to further enhance the system’s detection capabilities and application value. Hope this is helpful to you by:
Prevents Sybil accounts from manipulating DAO voting results through numerous fake identities…
Ensures governance decisions reflect genuine community will rather than attacker manipulation…
Improves community trust and participation enthusiasm in governance systems…

vicade · May 27, 2025, 8:29pm

Hi, I was wondering if you noticed any significant difference between your local cross-validation score and the leaderboard score. It seems like there is a distribution shift between the train and test sets. I’d really appreciate your thoughts or feedback on this. Thanks in advance!

gespsy · May 28, 2025, 11:27pm

Sybil detection with Catboost

Participant: gespsy
Email: vasilyabc@gmail.com

Abstract

This article presents techniques for training a CatBoost model to detect anomalies in labeled datasets with a strong class imbalance, where anomalies are significantly underrepresented in the training data. In this context, the anomalies are sybils. The paper discusses the development of a baseline model, feature engineering strategies, and hyperparameter optimization using Optuna.

Introduction

The initial dataset consisted of information about transactions and swaps on decentralized exchanges (DEXs) across the Base and Ethereum networks. At the outset, I hypothesized that since user activity takes place on different networks, anomalous behavior might manifest differently in each one. Based on this assumption, I trained two separate models for each network and then combined their predictions using a meta-model to generate the final output. However, this approach did not yield the expected results. In practice, the best performance was achieved by training a single model using features derived from both networks.

In addition to CatBoost, I experimented with individual models and ensemble methods from the PyOD library for anomaly detection. However, their performance was significantly inferior to that of CatBoost, so they are not discussed further in this article.

I also explored dimensionality reduction using PCA and addressed class imbalance with SMOTE-based oversampling. Nevertheless, neither approach led to any performance improvements, and therefore they are not covered in detail in this paper.

Metric Selection

The task required accurate probability estimation, and the primary evaluation metric was ROC-AUC. However, for model comparison and hyperparameter tuning, I primarily relied on the log-loss metric. Log-loss provides a more precise assessment of the predicted probabilities, especially in scenarios with severe class imbalance. In such cases, ROC-AUC can often approach or even reach 1.0, despite the model performing poorly in identifying anomalies. Therefore, log-loss was a more reliable indicator of model quality in this context.

Feature Extraction

For wallet addresses on both the Base and Ethereum networks, the following features were extracted:

n_swaps — number of unique swaps based on transaction hash
swap_volume_in — total volume (in USD) of tokens received through swaps
swap_volume_out — total volume (in USD) of tokens sent through swaps
n_unique_tokens_in — number of unique tokens received through swaps
n_unique_tokens_out — number of unique tokens sent through swaps
transfer_sent_count — number of unique outgoing transfers based on transaction hash
transfer_sent_amount_usd — total volume (in USD) of outgoing transfers
transfer_received_count — number of unique incoming transfers based on transaction hash
transfer_received_amount — total volume (in USD) of incoming transfers
n_transactions — total number of transactions
total_gas_used — total gas spent on transactions
total_gas_limit — total gas limit across all transactions
avg_gas_price — average gas price
total_tx_value — total value of all transactions
activity_span_days — number of days with recorded swap activity
transfer_balance_usd — difference between USD volume sent and received via transfers
transfer_activity_ratio — ratio of incoming to outgoing transfer counts (outgoing transfer counts always ≥ 1)
gas_efficiency — ratio of total gas used to total gas limit

Pearson Correlation

After calculating the Pearson correlation coefficient for each pair of features, highly correlated features (correlation > 0.95) were removed from the dataset. It is important to note that only features with strong positive correlation were excluded. Removing features with negative correlation led to a decrease in model performance.

As a result, the following features were removed from the dataset:

swap_volume_out_base
n_unique_tokens_out_base
transfer_received_count_base
transfer_received_amount_usd_base
swap_volume_out_eth
n_unique_tokens_out_eth
transfer_received_count_eth

Model Training and Hyperparameter Optimization

When training a CatBoost classification model on a highly imbalanced dataset, it is crucial to set the auto_class_weights parameter to appropriately account for the underrepresented class.

Hyperparameter tuning was performed using Optuna, optimizing the following parameters:

iterations
learning_rate
depth
l2_leaf_reg
border_count
min_data_in_leaf
grow_policy
loss_function
eval_metric
task_type
random_seed (the model was run with different random seeds)
bootstrap_type
subsample
bagging_temperature

Below is the ROC curve of the trained model:

Feature Importance

In the training data, there were significantly more addresses from the Ethereum network compared to the Base network. When merging the datasets, this imbalance led to many Base-related features being filled with zeros, as not all Ethereum addresses were present in the Base network.

The feature importance plot clearly shows that the model distinguishes between features from the two networks. Most Ethereum-related features appear at the top of the importance ranking, while Base-related features follow lower in the list.

Interestingly, the model identified gas-related features and the transfer_activity_ratio (the ratio of incoming to outgoing transactions) as the most informative for both networks. This suggests that transaction behavior and gas usage are key indicators for detecting anomalies across different blockchain networks.

Conclusion

This study demonstrates the effectiveness of using CatBoost for anomaly detection in blockchain transaction data with a highly imbalanced class distribution. Despite initial assumptions about the benefits of training separate models for different networks, the best performance was achieved by training a single model on combined features from both Ethereum and Base.

Feature engineering and correlation filtering played a crucial role in improving model performance and reducing redundancy. Hyperparameter tuning with Optuna further enhanced the model’s predictive capabilities, and log-loss proved to be a more reliable metric for model selection than ROC-AUC under strong class imbalance.

The feature importance analysis revealed that gas-related metrics and the transfer_activity_ratio were among the most significant indicators of anomalous behavior, regardless of the network. This insight could be valuable for future research and practical applications in blockchain security and fraud detection.

mahboobbiswas · May 28, 2025, 11:28pm

Sybil Detection with Human Passport and Octant

Participant: Mahboob Biswas

Email : mahboobbiswas@gmail.com

Contact : +91 7029232633

Date: May 28, 2025

Overview

Sybil attacks have long been a challenge in Web3, affecting everything from airdrops and governance to funding mechanisms. These attacks occur when individuals create multiple fake identities to manipulate systems, gain unfair advantages, or exploit incentives.

Human Passport by Holonym, Octant and the Ethereum Foundation are tackling this head-on by sponsoring a Sybil detection competition, where participants will build models on behaviors of known Sybil and normal addresses to predict Sybil scores of wallets.

Objective

The objective is to build a machine learning model that predicts the probability of a given wallet address being a sybil, using historical blockchain data.

How you process the data is entirely up to you—feature engineering, model selection, and optimization are in your hands. We will be using a private database of known Sybil and human wallets so you can use any trick up your sleeve to improve the score of your model.

Model Output

For a given address, the desired output of the model is a score between 0 and 1 (0=non-Sybil, 1=Sybil) indicating how likely the given address is a Sybil wallet.

1. Data Loading and Structure

The dataset is divided into two chains: Base and Ethereum. Each chain provides the following files:

train_addresses.parquet: Contains wallet addresses and their labels (0 for non-Sybil, 1 for Sybil).
test_addresses.parquet: Contains wallet addresses for which predictions are required.
transactions.parquet: Includes transaction details such as block number, timestamp, sender/receiver addresses, gas used, and transaction fees.
token_transfers.parquet: Records token transfer events, including sender, receiver, amount, and token details.
dex_swaps.parquet: Captures DEX swap events, including amounts, tokens, and pool information.

Key Observations:

Train Addresses: The training dataset includes labeled addresses, enabling supervised learning.
Test Addresses: Contains 20,369 unique addresses across both chains for prediction.
Transactions: Includes fields like FROM_ADDRESS, TO_ADDRESS, VALUE, GAS_USED, and TX_FEE.
Token Transfers: Provides FROM_ADDRESS, TO_ADDRESS, AMOUNT, and SYMBOL.
DEX Swaps: Includes SENDER, AMOUNT_IN, AMOUNT_OUT, TOKEN_IN, and TOKEN_OUT.

The datasets were loaded using pandas.read_parquet for efficient handling of large parquet files. Initial exploration confirmed no missing critical fields, but feature engineering was necessary to derive meaningful predictors.

2: Data Loading and Exploratory Data Analysis (EDA)

2.1 Label Distribution in Training Data

The training data exhibits a significant class imbalance, critical for model design:

Base:
- LABEL 0.0 (non-Sybil): 97.059109%
- LABEL 1.0 (Sybil): 2.940891%
Ethereum:
- LABEL 0.0 (non-Sybil): 95.20771%
- LABEL 1.0 (Sybil): 4.79229%

This imbalance suggests the need for techniques like class weighting or adjusted thresholds to prioritize Sybil detection (high recall).

2.2 Transaction Patterns

Transaction data was analyzed for usability:

Base:
- Original count: 2,333,362
- Valid records (after filtering invalid values): 2,323,798
- Percentage retained: 99.59%
Ethereum:
- Original count: 1,554,930
- Valid records: 1,554,678
- Percentage retained: 99.98%

Nearly all transaction data is usable, indicating high data quality.

2.3 Token Transfers

Token transfer data showed varying retention rates after cleaning:

Base:
- Original records: 4,517,682
- Clean records: 1,690,300
- Percentage retained: 37.42%
Ethereum:
- Original records: 3,496,171
- Clean records: 2,544,354
- Percentage retained: 72.78%

The significant drop in Base data suggests more missing or invalid entries compared to Ethereum.

2.4 DEX Swaps

DEX swap data was filtered for valid USD amounts:

Base:
- Original records: 239,020
- Clean AMOUNT_IN_USD: 179,949 (75.29%)
- Clean AMOUNT_OUT_USD: 169,758 (71.02%)
Ethereum:
- Original records: 588,606
- Clean AMOUNT_IN_USD: 523,891 (89.01%)
- Clean AMOUNT_OUT_USD: 513,147 (87.18%)
Ethereum retains a higher proportion of clean records, likely due to better data consistency.

3. Feature Engineering

Feature engineering was critical to transform raw blockchain data into predictive features. The prepare_data_for_xgboost function (not shown in the provided code but referenced) was used to create features. Below are the likely features engineered based on the dataset structure:

3.1 Transaction-Based Features

Transaction Count: Number of transactions per address.
Average Gas Used: Mean gas used across transactions.
Total Transaction Value: Sum of transaction values (in native currency).
Transaction Frequency: Number of transactions per unit time.
Unique Counterparties: Number of unique TO_ADDRESS or FROM_ADDRESS.

3.2 Token Transfer Features

Transfer Count: Number of token transfers per address.
Unique Tokens: Number of distinct tokens transferred.
Average Transfer Amount: Mean transfer amount (in USD or native units).
Transfer Velocity: Frequency of transfers over time.

3.3 DEX Swap Features

Swap Count: Number of DEX swaps per address.
Average Swap Amount: Mean swap amount (in USD).
Unique Token Pairs: Number of distinct token pairs swapped.
Swap Frequency: Swaps per unit time.

3.4 Temporal Features

Account Age: Time between the first and last transaction.
Activity Periods: Number of active days or blocks.
Transaction Recency: Time since the last transaction.

3.5 Handling Multi-Chain Data

Since the dataset includes both Base and Ethereum chains, features were aggregated across chains. Chain-specific identifiers (e.g., chain, chain_transfer, chain_dex) were removed during preprocessing to ensure model generalization.

3.6 Data Preprocessing

Deduplication: Duplicate addresses were handled by keeping the first occurrence.
Missing Values: Imputed using median values for numerical features to avoid bias.
Feature Scaling: Applied StandardScaler to normalize features for model training.
Feature Order Preservation: Ensured consistent feature order between training and testing using a saved xgb_feature_order.csv.

To prepare data for modeling:

Chain Identifiers Removed: Columns like chain, chain_transfer, and chain_dex were dropped to generalize the model across chains.
Missing Values: Filled with median values during test preparation to maintain robustness.
Key Features: Aggregated metrics (e.g., mean, sum, min, max) for transaction values, gas usage, timestamps, and DEX swap amounts were computed.

Feature importance analysis later revealed that temporal features (e.g., tx_lifetime_days) and transaction value metrics were critical predictors.

4: Model Training and Evaluation

The model was built using XGBoost, a gradient boosting framework suitable for imbalanced datasets.

4.1 Model Training

Initial training yielded a high validation AUC of 0.99841 at iteration 218, indicating strong discriminative power.

4.2 Initial Evaluation

Classification Report:

            precision    recall  f1-score   support
0.0         1.00      0.99      0.99     20562
1.0         0.83      0.97      0.90      1318
accuracy                        0.99     21880
macro avg   0.92      0.98      0.94     21880
weighted avg 0.99     0.99      0.99     21880

ROC AUC Score: 0.9984

The model excels at identifying Sybil wallets (recall 0.97) while maintaining high overall accuracy (0.99).

4.3 Model Evaluation

Further evaluation provided detailed metrics:

AUC: 0.9977
F1 Score: 0.8921
Precision: 0.8404
Recall: 0.9507
Top 10 Features:

Feature                     Importance
tx_lifetime_days            0.411957
VALUE_mean                  0.082130
VALUE_sum                   0.052093
BLOCK_TIMESTAMP_max         0.038899
dex_AMOUNT_OUT_USD_mean     0.035266
GAS_USED_mean               0.031219
BLOCK_TIMESTAMP_min         0.021890
GAS_PRICE_mean              0.018956
dex_BLOCK_TIMESTAMP_min     0.016806
received_CONTRACT_ADDRESS_nunique 0.014707

Temporal and value-based features dominate, highlighting their role in Sybil detection.

4.4 Hyperparameter Tuning

Hyperparameter tuning was conducted using a grid search (3-fold cross-validation, 30 candidates, 90 fits):

Best Parameters:
- subsample: 1.0
- min_child_weight: 1
- max_depth: 7
- learning_rate: 0.1
- gamma: 0
- colsample_bytree: 0.6
Best AUC (mean CV): 0.9978823572373976
Final Validation Metrics:
- AUC: 0.9989
- F1 Score: 0.9075
- Precision: 0.8459
- Recall: 0.9788

Tuning improved recall to 0.9788, crucial for minimizing missed Sybil wallets.

4.5 Confusion Matrix

Values:
- True Negatives (TN): 20,327
- False Positives (FP): 235
- False Negatives (FN): 28
- True Positives (TP): 1,290
Recall: 0.9788 (TP / (TP + FN))

The low FN count (28) underscores the model’s effectiveness at detecting Sybils.

5: Predictions on Test Addresses

Total Unique Test Addresses: 20,369 (after deduplication)
Process:
- Features prepared similarly to training data.
- Missing values filled with medians.
- Predictions made using the tuned XGBoost model.
Threshold: 0.1 (lowered from 0.5 to increase recall).
Output: Saved to submission.csv.

Conclusion

The Sybil Detection model, implemented with XGBoost, achieves exceptional performance:

Final AUC: 0.9989
Recall: 0.9788
F1 Score: 0.9075

Key strengths include:

High recall (few Sybils missed), vital for Web3 security.
Robust feature engineering, leveraging temporal and value-based metrics.
Generalization across Base and Ethereum chains via chain identifier removal.

The model effectively mitigates Sybil attacks by identifying fraudulent wallets with high accuracy, supported by thorough data preprocessing and hyperparameter optimization.

This write up is for my fifth submission. I am going to improve my model and then again I will write a detail description about my model.

alertcat · May 29, 2025, 4:35pm

Sybil Detection Model Report: GPU-Accelerated Multi-Chain Analysis

Author: Casuwyt Periay
Date: May 29th, 2025

1. Introduction

1.1. Problem Statement & Motivation

Sybil attacks pose a fundamental threat to decentralized systems in the Web3 ecosystem. These attacks involve malicious actors creating multiple fake identities (wallet addresses) to gain disproportionate influence, exploit airdrop distributions, manipulate governance decisions, or extract unfair rewards from protocol incentives. The detection of Sybil accounts is crucial for maintaining the integrity and fairness of blockchain networks. This project develops a machine learning solution to identify potential Sybil addresses based on their on-chain behavioral patterns across multiple blockchain networks.

1.2. Technical Approach

This analysis implements a GPU-accelerated machine learning pipeline leveraging RAPIDS (cuDF and cuPy) for efficient processing of large-scale blockchain data. The approach combines data from two major networks (Ethereum and Base) to create a comprehensive behavioral profile for each address, enabling robust Sybil detection across different blockchain ecosystems.

1.3. Methodology Overview

The project follows a systematic approach:

GPU-Accelerated Data Processing: Utilizing NVIDIA L4 GPU with RAPIDS for fast data loading and feature extraction
Multi-Chain Analysis: Combining behavioral data from Ethereum and Base networks
Comprehensive Feature Engineering: Creating 24 behavioral features capturing transaction, token, and DEX interaction patterns
Ensemble Modeling: Employing XGBoost and LightGBM with careful hyperparameter tuning
Robust Validation: Implementing proper train-test splitting and cross-validation strategies

2. Data Description and Preparation

2.1. Dataset Overview

The analysis utilized comprehensive blockchain data from two networks:

Ethereum Network:

Training addresses: 52,501 (with 2,516 Sybils)
Test addresses: 20,369
Transactions, token transfers, and DEX swap data

Base Network:

Training addresses: 51,515 (with 1,515 Sybils)
Test addresses: 20,369
Similar transaction and activity data

Combined Dataset:

Total training samples: 99,008 (after removing 59 potential false negatives)
Sybil rate: 2.57% (2,526 Sybils out of 99,008 addresses)
Class imbalance ratio: approximately 38:1 (Normal:Sybil)

2.2. Data Quality Enhancements

Several data quality measures were implemented:

Numeric Data Cleaning: Handling of infinite values and extreme outliers
Temporal Data Processing: Converting timestamps to Unix format for better model compatibility
Address Normalization: Converting all addresses to lowercase for consistency
False Negative Removal: Excluding 59 addresses identified as potential mislabeled samples

3. Feature Engineering

3.1. Feature Categories

The feature engineering process resulted in 24 carefully crafted features across five main categories:

1. Transaction Features (12 features)

Basic counts: tx_sent_count, tx_received_count, tx_total_count
Value statistics: tx_sent_value_sum, tx_sent_value_mean, tx_sent_value_std, tx_sent_value_max
Network metrics: unique_to_addresses
Temporal features: tx_first_timestamp_unix, tx_last_timestamp_unix, tx_days_active

2. Token Transfer Features (8 features)

Activity metrics: token_sent_count, token_received_count, token_total_count
Value analysis: token_sent_usd_sum, token_sent_usd_mean, token_sent_usd_max
Diversity metrics: unique_tokens_sent, unique_symbols_sent

3. DEX Interaction Features (4 features)

dex_swap_count
dex_volume_in_usd
dex_avg_swap_size_usd
unique_dex_platforms

3.2. Feature Importance Analysis

The most influential features identified by the models were:

Top 5 Features:

tx_first_timestamp_unix (importance: 158.59) - Account creation time
tx_last_timestamp_unix (importance: 124.52) - Recent activity indicator
unique_to_addresses (importance: 69.05) - Network interaction breadth
tx_sent_value_max (importance: 67.51) - Maximum transaction value
tx_sent_value_sum (importance: 66.03) - Total value transferred

4. Model Development and Results

4.1. Model Architecture

Two gradient boosting models were employed:

XGBoost Configuration:

GPU acceleration enabled
125 boosting rounds
Custom objective for binary classification
Scale position weight: 38.2 (to handle class imbalance)

LightGBM Configuration:

GPU training with OpenCL
Early stopping after 73 iterations
Binary objective with class weight balancing
Feature histogram optimization with 256 bins

4.2. Performance Metrics

The models achieved strong performance in identifying Sybil addresses:

Training Performance:

XGBoost Validation AUC: 0.9965
LightGBM Validation AUC: 0.9960

Test Set Predictions:

Total test addresses: 20,369
Predicted Sybils (>0.9 probability): 103 (0.51%)
High-risk addresses (>0.8 probability): 326 (1.60%)
Low-risk addresses (<0.2 probability): 14,734 (72.33%)

4.3. Prediction Distribution Analysis

The model produced well-separated predictions:

Mean prediction score: 0.163
Median prediction score: 0.050
Standard deviation: 0.226
Skewness: 1.568 (indicating right-skewed distribution)
Kurtosis: 1.400 (moderate peakedness)

5. Key Findings and Insights

5.1. Behavioral Patterns of Sybil Accounts

Temporal Characteristics: Sybil accounts show distinct temporal patterns, with account age and activity timing being the most predictive features
Transaction Behavior: Sybils tend to have:

Higher transaction volumes and values
More diverse interaction patterns (unique addresses)
Larger maximum transaction values

Network Effects: The breadth of network interactions (unique addresses contacted) is a strong indicator of Sybil behavior

download (1)2385×1990 348 KB

5.2. Feature Category Importance

Analysis of feature categories revealed:

Transaction features: 40.4% of total importance
Value-based features: 22.8% of total importance
Time-based features: 16.1% of total importance
Token features: 14.3% of total importance
DEX features: 6.3% of total importance

download (2)1188×790 36.4 KB

6. Technical Implementation Details

6.1. GPU Acceleration Benefits

Hardware: NVIDIA L4 GPU
Memory efficiency: Maintained under 2.1GB CPU memory usage
Processing speed: Feature extraction completed in 0.08 minutes
Scalability: Successfully processed over 100k addresses with millions of associated transactions

6.2. Model Optimization Strategies

Class Imbalance Handling: Used scale_pos_weight parameter (38.2) to properly weight minority class
Feature Scaling: Standardized features using scikit-learn’s StandardScaler
Outlier Management: Implemented quantile-based clipping for extreme values
Missing Value Strategy: Filled NaN values with 0 after careful analysis

7. Conclusions and Recommendations

7.1. Model Performance Summary

The developed Sybil detection system demonstrates strong performance with:

High discriminative power (AUC > 0.996)
Effective identification of high-risk addresses
Well-calibrated probability outputs
Efficient GPU-accelerated processing

7.2. Practical Applications

The model can be deployed for:

Airdrop Protection: Screening addresses before token distributions
Governance Security: Identifying potential Sybil voters
Risk Assessment: Continuous monitoring of network participants
Cross-chain Analysis: Detecting Sybils operating across multiple networks

7.3. Future Improvements

Potential enhancements include:

Graph-based Features: Incorporating network topology analysis
Behavioral Sequences: Analyzing temporal patterns of actions
Cross-chain Correlation: Linking addresses across more networks
Real-time Detection: Implementing streaming analysis capabilities

7.4. Limitations and Considerations

Label Quality: Model performance depends on the accuracy of training labels
Temporal Bias: Features may be less effective for newly created addresses
Adversarial Adaptation: Sybil operators may modify behavior to evade detection
False Positive Impact: ~27.7% of addresses have probabilities between 0.1-1.0, requiring careful threshold selection

8. Final Remarks

This GPU-accelerated Sybil detection system successfully identifies suspicious addresses with high accuracy while maintaining computational efficiency. The combination of comprehensive feature engineering, robust model selection, and careful validation produces a practical solution for enhancing security in decentralized systems. The model’s strong performance, particularly in identifying temporal and transactional patterns, provides a solid foundation for protecting blockchain ecosystems from Sybil attacks.

aabo0090 · May 29, 2025, 9:08pm

Sybil Detection Model - Technical Writeup

Executive Summary

This project implements an advanced machine learning pipeline for detecting Sybil attacks in blockchain networks, specifically targeting Ethereum and Base networks. Sybil attacks involve creating multiple fake identities to manipulate decentralized systems, making their detection crucial for maintaining network integrity and security.

Problem Statement

Sybil detection in blockchain networks presents unique challenges:

Scale: Processing millions of transactions across multiple networks
Imbalanced Data: Sybil addresses typically represent a small fraction of total addresses
Feature Engineering: Extracting meaningful patterns from blockchain transaction data
Cross-Network Analysis: Handling data from multiple blockchain networks (Ethereum and Base)

Solution Architecture

Core Components

Data Loading Module: Robust data ingestion system handling multiple blockchain networks
Feature Engineering Pipeline: Comprehensive feature extraction from transaction patterns
Model Ensemble: Multiple machine learning algorithms for improved detection accuracy
Cross-Validation Framework: Rigorous model evaluation and validation

Key Features

1. Robust Data Handling

Multi-Network Support: Processes both Ethereum and Base blockchain data
Error Resilience: Graceful handling of missing or corrupted data files
Dynamic Column Detection: Automatically identifies address and label columns
Data Consolidation: Combines datasets from multiple sources while removing duplicates

2. Advanced Feature Engineering

The model extracts several categories of features:

Transaction Volume Features:

Transaction count (sent + received)
Unique counterpart addresses
Total value sent/received
Average transaction values
Maximum transaction values
Value standard deviation

Behavioral Pattern Features:

Sent-to-received value ratio
Unique counterpart ratio (diversity of interactions)
Transaction frequency patterns

Temporal Features:

Days active on network
First transaction timestamp
Transaction frequency over time
Activity pattern analysis

3. Ensemble Learning Approach

The model employs three complementary algorithms:

Random Forest Classifier:

Handles non-linear relationships
Built-in feature importance ranking
Robust to outliers
Configuration: 200 estimators, max depth 8, balanced class weights

XGBoost Classifier:

Gradient boosting for complex pattern detection
Advanced regularization techniques
Optimized for imbalanced datasets
Configuration: 200 estimators, learning rate 0.05, scale_pos_weight adjustment

Logistic Regression:

Interpretable linear relationships
Regularized to prevent overfitting
Scaled features for optimal performance
Configuration: L2 regularization, balanced class weights

Technical Implementation

Data Pipeline

Raw Data → Data Loading → Feature Engineering → Model Training → Ensemble Prediction

Data Loading:

Loads parquet files for transactions, token transfers, and DEX swaps
Handles missing files gracefully
Combines multi-network data

Feature Engineering:

Transaction pattern analysis
Temporal behavior extraction
Outlier detection and capping
Missing value imputation

Model Training:

Stratified cross-validation
Class imbalance handling
Hyperparameter optimization
Ensemble weight optimization

Key Algorithmic Innovations

1. Outlier Handling

Value Capping: Clips extreme transaction values at 99th percentile
Statistical Bounds: Removes values beyond 3 standard deviations
Robust Scaling: Uses RobustScaler for feature normalization

2. Imbalanced Data Handling

Class Weighting: Automatically balances class weights in all models
Scale Pos Weight: XGBoost parameter tuned for class imbalance
Stratified Sampling: Maintains class distribution in cross-validation

3. Feature Selection Strategy

Domain Knowledge: Focus on transaction patterns known to indicate Sybil behavior
Correlation Analysis: Removes highly correlated features
Stability Checking: Ensures features are robust across different data splits

Performance Evaluation

Validation Strategy

3-Fold Stratified Cross-Validation: Maintains class distribution across folds
ROC-AUC Scoring: Appropriate metric for imbalanced binary classification
Individual Model Assessment: Evaluates each algorithm’s contribution

Model Interpretability

Feature Importance: Random Forest provides natural feature ranking
Coefficient Analysis: Logistic Regression offers interpretable linear relationships
Ensemble Weights: Equal weighting strategy for model combination

Results and Insights

Model Performance

The ensemble approach provides several advantages:

Improved Robustness: Multiple algorithms reduce single-model bias
Better Generalization: Ensemble typically outperforms individual models
Risk Mitigation: Failures in individual models don’t compromise entire system

Feature Insights

Key discriminative features for Sybil detection:

Transaction Frequency: Automated Sybil accounts often show unusual activity patterns
Value Ratios: Sybil accounts typically have distinctive sent/received ratios
Counterpart Diversity: Legitimate users interact with more diverse addresses
Temporal Patterns: Sybil accounts often show batch-like activity patterns

Production Considerations

Scalability

Memory Management: Efficient data processing with garbage collection
Parallel Processing: Multi-core utilization for model training
Chunked Processing: Can handle large datasets through batch processing

Robustness

Error Handling: Comprehensive exception handling throughout pipeline
Data Validation: Automatic detection of data quality issues
Fallback Mechanisms: Graceful degradation when components fail

Monitoring

Performance Tracking: Cross-validation scores for model health monitoring
Data Drift Detection: Feature distribution monitoring capabilities
Prediction Quality: Output validation and sanity checking

Future Enhancements

Technical Improvements

Advanced Feature Engineering: Graph-based features using network topology
Deep Learning Integration: Neural networks for complex pattern recognition
Real-time Processing: Streaming architecture for live detection
Multi-chain Expansion: Support for additional blockchain networks

Operational Improvements

Automated Retraining: Continuous learning from new labeled data
A/B Testing Framework: Compare model versions in production
Explainability Tools: Enhanced interpretability for compliance requirements
API Integration: RESTful API for real-time prediction serving

Conclusion

This Sybil detection model represents a comprehensive approach to blockchain fraud detection, combining domain expertise with advanced machine learning techniques. The ensemble methodology provides robust performance while maintaining interpretability, making it suitable for production deployment in high-stakes blockchain security applications.

The modular architecture allows for easy extension and modification, while the robust error handling ensures reliability in production environments. The focus on feature engineering from domain knowledge provides a strong foundation for accurate Sybil detection across multiple blockchain networks.

This writeup provides a technical overview of the implementation. For specific deployment instructions or detailed algorithm parameters, please refer to the source code documentation.

Topic		Replies	Views
Sybil Detection Model Writeup: GPU-Accelerated Multi-Chain Analysis General	0	17	May 29, 2025
Sybil Detection Model Writeup: Crypto Pond Competition. General	0	33	April 30, 2025
What's a good enough approach for sybil's in the near future? General	28	328	May 30, 2025
Sybil Detection Model Writeup: Crypto Pond Competition General	0	24	April 30, 2025
Anti-Sybil for Epoch 5 & Beyond General	1	80	September 14, 2024