Write Up for Models Predicting Sybil Scores of Wallets

Hello Modeloors,

Consider this thread as your home for sharing all things related to your submissions in the Octant Sybil analysis challenge where you need to predict sybil scores of wallets.

Your write-up here is required for being eligible to receive a prize

We encourage you to be visual in your submissions to show weights given to the models, share your juypter notebooks or code used in the submissions, other datasets that are useful for other participants and any other information you deem valuable to participants

Since write-ups can be made after submissions close, other participants cannot copy your methodology in the round. You can take cues for writeups here from another competition we held, along with ideas for creating your own model.

The format of submissions is open ended and free for you to express yourself the way you like. You can share as much or as little as you like, but you need to write something here to be considered for prizes.

Good luck predictoooors

3 Likes

Sybil Detection Model Writeup: Crypto Pond Competition (Highly Detailed Analysis)

Participant: Rhythm Suthar

Email : rhythmsuthar123@gmail.com

Contact : +91 7046830999

Date: April 19, 2025

1. Introduction

1.1. Problem Statement & Motivation

Sybil attacks represent a persistent and significant threat within decentralized Web3 ecosystems. These attacks involve a single adversary creating and controlling numerous fake identities (wallets) to illegitimately amplify their influence, exploit incentive mechanisms, or disrupt system operations. The consequences are far-reaching, undermining the fairness of token distributions (airdrops), compromising the integrity of decentralized governance votes, distorting grant funding allocations, and potentially enabling other malicious activities like Distributed Denial of Service (DDoS) attacks or market manipulation. Effectively detecting and mitigating Sybil attacks is therefore crucial for the security, equity, and long-term viability of blockchain protocols and applications. This project addresses this challenge directly by aiming to develop a high-fidelity machine learning model capable of identifying Sybil wallets based on their on-chain behavioral footprint.

1.2. Competition Context

This work was undertaken as part of the “Sybil Detection with Human Passport and Octant” competition hosted on CryptoPond. Sponsored by Human Passport by Holonym, Octant, and the Ethereum Foundation, the competition provided a valuable dataset and a clear objective: leverage historical blockchain data to build a model that assigns a probability score (from 0 for non-Sybil to 1 for Sybil) to potentially suspicious wallet addresses.

1.3. Approach Overview

To achieve the competition’s objective, a rigorous, multi-stage approach was adopted, emphasizing deep data understanding, comprehensive feature engineering, robust modeling techniques, and iterative optimization. The core stages included:

Data Loading, Cleaning, and Consolidation: Ingesting and preparing the provided multi-chain datasets.

Exploratory Data Analysis (EDA): Performing extensive analysis to identify differentiating characteristics between labeled Sybil and Non-Sybil accounts.

Feature Engineering: Constructing a wide array of quantitative features capturing various dimensions of on-chain behavior.

Modeling: Selecting, training, and evaluating powerful Gradient Boosting Machine (GBM) models suitable for the tabular feature set.

Optimization & Ensembling: Tuning model hyperparameters and combining multiple models to maximize predictive accuracy and robustness.
This report details each stage and the findings therein.

2. Data Description and Preparation

2.1. Data Sources

The analysis utilized a rich dataset provided by the competition organizers, spanning wallet activities on both the Base and Ethereum networks. The key components were supplied as Parquet files:

Labeled Training Data (train_addresses.parquet): Provided the ground truth for model training. Contained wallet addresses (ADDRESS) and their corresponding binary labels (LABEL, where 1 indicates Sybil and 0 indicates Non-Sybil). The competition description noted labels were aggregated from diverse sources including Gitcoin Passport stamps, LayerZero Sybil reports, zkSync Sybil reports, Optimism (OP) Sybil reports, Octant contributions, and internal Gitcoin ban lists.

Unlabeled Test Data (test_addresses.parquet): Contained the target wallet addresses for which final Sybil probability predictions were required. This set included approximately 9.8k unique addresses.

Transaction Data (transactions.parquet): Contained detailed records for individual blockchain transactions initiated by (FROM_ADDRESS) or sent to (TO_ADDRESS) relevant wallets. Key fields included BLOCK_NUMBER, BLOCK_TIMESTAMP, TX_HASH, VALUE (ETH), TX_FEE, GAS_PRICE, GAS_USED, GAS_LIMIT, INPUT_DATA, STATUS/TX_SUCCEEDED, and chain-specific L1 fee details for Base. Over 1.3 million transaction records were present in the combined dataset.

Token Transfer Data (token_transfers.parquet): Detailed ERC-20 style token movements, linked to transactions. Included BLOCK_TIMESTAMP, TX_HASH, initiating address (ORIGIN_FROM_ADDRESS), token CONTRACT_ADDRESS, effective sender (FROM_ADDRESS), receiver (TO_ADDRESS), token SYMBOL, DECIMALS, transfer AMOUNT, and estimated AMOUNT_USD. This was the largest dataset, with over 4.5 million combined records.

DEX Swap Data (dex_swaps.parquet): Recorded swaps on decentralized exchanges linked to user transactions. Included BLOCK_TIMESTAMP, TX_HASH, initiating address (ORIGIN_FROM_ADDRESS), DEX CONTRACT_ADDRESS, POOL_NAME, PLATFORM identifier, tokens involved (TOKEN_IN, TOKEN_OUT, SYMBOL_IN, SYMBOL_OUT), amounts (AMOUNT_IN, AMOUNT_OUT), and estimated USD values (AMOUNT_IN_USD, AMOUNT_OUT_USD). The combined dataset contained approximately 685k swap records.

2.2. Data Preparation Steps

Several preparation steps were necessary before analysis and feature engineering:

Data Consolidation: For each activity type (transactions, transfers, swaps), the data from the Base and Ethereum files were concatenated into a single pandas DataFrame. A chain column (‘base’ or ‘ethereum’) was added to these activity tables during this process to preserve network origin.

Type Conversion: Columns intended for numerical analysis but loaded as objects (e.g., BLOCK_NUMBER, NONCE, GAS_USED, DECIMALS) were converted to appropriate numeric types using pd.to_numeric(errors=‘coerce’). Timestamp columns (BLOCK_TIMESTAMP) were converted to datetime objects using pd.to_datetime(). Boolean-like columns (TX_SUCCEEDED) were mapped to integers (1/0).

Training Label Cleaning: The raw training address file contained duplicate addresses. An analysis confirmed that labels for these duplicates were consistent. Therefore, duplicates were removed using drop_duplicates(subset=[‘ADDRESS’], keep=‘first’), resulting in a final training set of 99,067 unique addresses. The LABEL column was also converted to integer type.

Imbalance Assessment: The cleaned training data exhibited significant class imbalance, with Sybils (LABEL=1) constituting only ~2.5% of the unique addresses. This imbalance necessitated specific handling strategies during modeling (e.g., scale_pos_weight).

3. Exploratory Data Analysis (EDA) - Detailed Findings

A comprehensive EDA was performed to deeply understand the behavioral differences between the labeled Sybil and Non-Sybil populations. This involved visualizing feature distributions and comparing statistical properties, leading to critical insights that guided feature engineering and modeling.

3.1. Activity Levels & Value

Finding: Sybil accounts consistently demonstrated significantly higher on-chain activity volume and value movement compared to Non-Sybil accounts.

Evidence:

Box plots comparing the log-transformed counts (log(count + 1)) for outgoing transactions (tx_out_count), incoming transactions (tx_in_count), token transfers (tt_count), and DEX swaps (ds_count) clearly showed higher medians, interquartile ranges (IQRs), and upper whiskers for the Sybil group (Label 1).

Bar plots comparing the mean counts further emphasized this disparity; for example, the mean tx_out_count for Sybils was nearly 6 times higher than for Non-Sybils (~17.9 vs. ~3.0). Similar large differences were observed for tt_count (~13.2 vs. ~3.9) and ds_count (~5.2 vs. ~2.2).

Log-transformed box plots for value summation features (tx_out_value_sum, tx_in_value_sum, tt_amount_usd_sum, ds_amount_in_usd_sum) showed distributions heavily shifted towards higher values for Sybils. Non-Sybil value distributions were tightly clustered near zero (log(1)).

Interpretation: This suggests Sybil strategies often involve a high frequency of actions and/or moving larger amounts of assets, potentially related to farming, multi-account coordination, or attempts to meet activity thresholds.

Plots:

3.2. Network Interaction Breadth

Finding: Sybil accounts interacted with a significantly wider and more diverse network of counterparties and smart contracts.

Evidence:

Log-transformed box plots and mean comparison bar plots showed Sybils had considerably higher counts for unique destination addresses (tx_out_unique_to_addr_count) and unique source addresses (tx_in_unique_from_addr_count). The mean number of unique outgoing destinations for Sybils (~10.6) was over 8 times higher than for Non-Sybils (~1.3).

Similar significant differences were observed for unique token contracts interacted with (tt_unique_contracts), unique token symbols transferred (tt_unique_symbols), unique DEX platforms used (ds_unique_platforms), and unique DEX pools interacted with (ds_unique_pools).

Interpretation: This pattern might indicate Sybil operators managing interactions across a large set of controlled addresses, participating in numerous different protocols or token ecosystems simultaneously (e.g., airdrop hunting across many projects), or using intermediary contracts/addresses, leading to broader connectivity compared to typical users who might have more focused interaction patterns.

3.3. Temporal Patterns

Finding: The temporal characteristics of Sybil accounts in this dataset were distinct, suggesting longer-term, persistent activity rather than solely ephemeral behavior.

Evidence:

Account Age (account_age_days): Box plots revealed that the median account age for Sybils was significantly higher than for Non-Sybils. The IQR for Sybils was also shifted towards older ages.

Recency (days_since_last_activity): Conversely, Sybils exhibited much more recent activity. The median number of days since the last recorded on-chain action was substantially lower for Sybils, with their distribution tightly clustered near zero, whereas Non-Sybils showed a much wider spread extending to longer periods of inactivity.

Duration (activity_duration_days): Correspondingly, the median duration between the first and last observed activity was significantly longer for Sybil accounts.

Interpretation: This profile suggests that many Sybils in this labeled set might not be simple, short-lived bots created for a single event, but potentially represent established addresses used over extended periods for ongoing activities, perhaps adapting strategies over time. Their recent activity further implies continuous operation.

Plot:

3.4. Chain Preference

Finding: A very strong pattern emerged regarding the preferred blockchain network for activity.

Evidence: Histograms of the base_tx_ratio (proportion of outgoing transactions on Base) showed two distinct peaks at 0 (all Ethereum) and 1 (all Base). For Non-Sybil accounts (Label 0), both peaks were substantial, although the peak at 0 (Ethereum) was larger. For Sybil accounts (Label 1), however, the distribution was overwhelmingly dominated by a massive peak at 1 (Base), with very few Sybils showing primarily Ethereum activity or mixed activity.

Interpretation: This indicates a strong tendency for Sybil accounts within this specific dataset and timeframe to focus their activities on the Base network. This could be due to lower fees, specific incentive programs, or particular protocols targeted on Base during the period covered by the data. This feature appears highly discriminative.

Plot:

3.5. Transaction Costs/Efficiency

Finding: While Sybils incurred higher total fees/gas due to higher volume, their average cost per transaction showed subtle differences.

Evidence: Log-scaled box plots for the mean outgoing transaction fee (tx_out_fee_mean) and mean gas used (tx_out_gas_used_mean) showed slightly lower median values for Sybil accounts compared to Non-Sybils.

Interpretation: This could tentatively suggest that Sybil transactions, on average, might be computationally simpler (e.g., basic transfers vs. complex DeFi interactions requiring more gas) or that Sybil operators employ strategies (potentially automated) to optimize for lower gas prices or fees more consistently than average users.

3.6. Feature Correlations

Finding: Several groups of engineered features exhibited high positive correlations.

Evidence: The correlation heatmap revealed strong positive correlations (>0.8) between:

Mean and median for value-based features (e.g., tx_out_value_mean / median).

Various count metrics across different activity types (e.g., ds_count / tt_count).

Different uniqueness metrics (e.g., ds_unique_pools / tt_unique_contracts).

Temporal features like account_age_days and activity_duration_days.

Interpretation: High correlations indicate some potential redundancy between features. For instance, mean and median value features capture very similar information. While tree-based models can handle multicollinearity, this information could be used for feature selection if model simplification or further optimization were needed.

Plot:

3.7. Missing Values

Finding: The presence and pattern of missing values (NaNs) in the aggregated features were strongly correlated with the Sybil label.

Evidence:

Calculation of missing percentages showed high rates for features derived from specific activities (especially DEX swaps, followed by token transfers and transactions), primarily indicating a lack of that activity type for an address.

The missing no matrix plot, sorted by label, visually demonstrated that Non-Sybil accounts (Label 0, bottom part of plot) had significantly more missing data across nearly all feature categories compared to Sybil accounts (Label 1, top part). Sybils were much more likely to have some activity recorded across transactions, transfers, and swaps.

Interpretation: This crucial insight suggests that a lack of broad on-chain engagement (resulting in NaNs for many aggregated features) is itself a characteristic distinguishing Non-Sybils from the more broadly active Sybils in this dataset. This implies that how NaNs are handled during modeling is important; they contain predictive information.

Plot:

4. Feature Engineering

Guided by the EDA, a comprehensive feature engineering process was undertaken to transform the raw, time-series activity data into a static feature vector for each address suitable for machine learning models. Approximately 55 numerical features were constructed.

4.1. Feature Categories

Basic Aggregates: These captured the overall volume and central tendency of core activities. Included metrics like tx_out_count, tx_in_count, tt_count, ds_count, and statistical summaries (sum, mean, median, std) for numerical fields like VALUE (ETH), TX_FEE, GAS_USED, GAS_PRICE, AMOUNT_USD, AMOUNT_IN_USD, AMOUNT_OUT_USD. Aggregations were performed separately based on the address acting as the source (e.g., tx_out_* from FROM_ADDRESS) or destination (e.g., tx_in_* from TO_ADDRESS) where applicable.

Uniqueness Counts: Quantified the breadth of an address’s interactions using nunique() aggregations. Examples include tx_out_unique_to_addr_count, tx_in_unique_from_addr_count, tt_unique_contracts, tt_unique_symbols, ds_unique_platforms, ds_unique_pools, ds_unique_tokens_in, ds_unique_tokens_out. These aimed to measure network connectivity and diversity of protocol/token usage.

Temporal Features: Captured the lifecycle and timing patterns of account activity. This involved calculating the overall account_age_days (time since first observed activity), days_since_last_activity (time since last observed activity), and activity_duration_days (time between first and last activity) by combining min/max timestamps across all activity types. Similar duration and recency metrics were also calculated specifically for token transfers (days_since_last_tt, tt_activity_duration_days) and DEX swaps (days_since_last_ds, ds_activity_duration_days).

Chain Preference: Based on the chain column added to the raw transaction data, features like base_tx_ratio and ethereum_tx_ratio were calculated as the proportion of an address’s outgoing transactions occurring on each respective chain.

Activity Ratios: Ratios were engineered to capture relative behaviors and potentially normalize for overall activity level. Examples include tx_val_out_in_ratio, tx_count_out_in_ratio, tx_unique_out_addr_ratio, tx_unique_in_addr_ratio, tx_failed_ratio, ds_tt_count_ratio (swap vs transfer frequency), ds_tt_usd_sum_ratio (swap vs transfer value), and activity_ratio (active duration relative to total age).

Specific Counts: Counts of transfers involving key ecosystem tokens (tt_weth_count, tt_usdc_count, tt_usdt_count) were included as potentially indicative features.

4.2. Final Preparation

After merging all engineered features, intermediate timestamp columns used only for temporal calculations were dropped. A final preparation step addressed potential numerical issues before modeling:

Inf Handling: Replaced any infinity values (potentially resulting from ratio calculations with near-zero denominators) with NaN.

Clipping: Clipped extremely large positive or negative finite values to the approximate limits of float32 (divided by 10 for safety) to prevent issues in XGBoost.

NaN Imputation: Filled all remaining NaN values (primarily resulting from addresses lacking specific types of activity, or missing USD price data) with the distinct numerical value -1. This strategy allows tree-based models to potentially learn from the pattern of missingness itself, treating it differently from a genuine zero value.

5. Modeling Approach

A robust modeling strategy was employed, focusing on powerful gradient boosting algorithms and best practices for validation and ensembling.

Validation Strategy: Stratified 5-Fold Cross-Validation served as the cornerstone for model evaluation and generating out-of-fold (OOF) predictions. Stratification by the LABEL column ensured that the severe class imbalance (~2.5% Sybil) was preserved in each train/validation split, leading to more reliable AUC estimates and preventing folds from having zero or very few Sybil examples. Area Under the ROC Curve (AUC) was used as the primary optimization and evaluation metric, as it effectively measures a model’s ability to rank positive instances higher than negative ones, which is suitable for imbalanced classification.

Handling Imbalance: The scale_pos_weight parameter, available in LightGBM, XGBoost, and CatBoost, was used to counteract the class imbalance. It was set to the ratio of negative class count to positive class count (~38.2), effectively increasing the weight (importance) of correctly classifying the minority Sybil class during model training.

Models: Three state-of-the-art Gradient Boosting Machine (GBM) implementations were selected for their high performance on tabular data:

LightGBM: Chosen for its speed, efficiency, and excellent predictive power. Hyperparameters were rigorously tuned using the Optuna library, performing a search over 42 trials where each trial involved a full 5-fold CV evaluation optimizing for mean AUC.

XGBoost: A widely adopted and powerful GBM library. It was trained using a competitive default parameter set, providing model diversity.

CatBoost: Known for its robustness and unique handling of categorical features (though less critical here as features were numeric). Trained with competitive default parameters to add further diversity to the ensemble.

GPU Acceleration: Training for all three models was performed utilizing GPU acceleration (device=‘gpu’/‘cuda’/task_type=‘GPU’) to significantly reduce computation time on the large feature set.

Ensembling: A Weighted Average Ensemble was constructed as the final predictive model. The predictions from the tuned LightGBM, XGBoost, and CatBoost models (generated via their respective 5-fold CV processes on the test set) were averaged together. The weights assigned to each model were proportional to their individual OOF AUC scores relative to a baseline AUC of 0.5 (weight_i = (AUC_i - 0.5) / sum(AUC_j - 0.5)). This approach gives slightly more influence to models that demonstrated better OOF performance.

6. Results

The comprehensive modeling pipeline yielded exceptionally high performance, validating the effectiveness of the engineered features and chosen algorithms.

Individual Model Performance (OOF AUC): Each of the three GBMs achieved outstanding OOF AUC scores on the full feature set, demonstrating strong individual predictive capabilities:

Tuned LightGBM OOF AUC: 0.996945

XGBoost OOF AUC: 0.996921

CatBoost OOF AUC: 0.996928

The remarkable consistency across these diverse GBM implementations underscores the strong signal captured by the engineered features.

Final Ensemble Performance (OOF): The weighted average ensemble (which resulted in near-equal weights due to the very similar individual AUCs) produced the following robust OOF performance:

Weighted Ensemble OOF AUC: 0.997309

This score, while marginally lower than the absolute best single model OOF in this specific run, represents a highly reliable estimate of generalization performance and benefits from the combined strengths of three models.

Classification Metrics (at 0.5 probability threshold):

Accuracy: 0.9908 (~99.1%) - High, but influenced by imbalance.

Sybil Recall: 0.97 - The ensemble correctly identified 97% of the true Sybil accounts (missing only 72 out of 2528). This high recall is critical for effective Sybil detection.

Sybil Precision: 0.75 - When the ensemble predicted an account was Sybil, it was correct 75% of the time. The remaining 25% (838 addresses) were False Positives.

Sybil F1-Score: 0.84 - A strong harmonic mean of precision and recall.

Confusion Matrix: The matrix quantified the trade-off: very few missed Sybils (False Negatives = 72) at the expense of a moderate number of misclassified Non-Sybils (False Positives = 838).

Prediction Distribution: Visual analysis confirmed excellent separation between the predicted probabilities for the two classes, with most predictions concentrated very close to 0 or 1.

Key Feature Importances: Analysis of feature importances (primarily from tuned LGBM, averaged across folds) revealed the most influential factors driving the model’s predictions:

Top Tier: Temporal features (account_age_days, days_since_last_tt, days_since_last_activity, tt_activity_duration_days), Transaction Cost/Efficiency (tx_out_gas_used_mean, tx_out_gas_price_mean, tx_out_fee_sum), and key Ratio features (tx_unique_out_addr_ratio, tx_count_out_in_ratio).

Highly Important: Other significant features included value summaries (tx_out_value_median), other cost metrics (tx_out_fee_mean), value ratios (tx_val_out_in_ratio), overall activity duration (activity_duration_days), and chain preference ratios (ethereum_tx_ratio, base_tx_ratio).

This ranking strongly aligns with the EDA findings, confirming that account lifecycle, recency, activity breadth, transaction efficiency, and chain choice were the most powerful predictors in this dataset.

7. Discussion & Conclusion

This project successfully engineered a highly effective machine learning solution for Sybil detection tailored to the specific dataset and objectives of the CryptoPond competition. The detailed Exploratory Data Analysis was instrumental in identifying key behavioral differentiators, notably the higher activity, broader network interaction, distinct temporal profiles (older, more consistently active Sybils), and strong Base chain preference exhibited by labeled Sybil accounts.

A comprehensive feature set was constructed to quantify these observations. The application of tuned and diverse Gradient Boosting Models (LightGBM, XGBoost, CatBoost) within a robust Stratified K-Fold cross-validation framework yielded outstanding individual model performance, with OOF AUC scores approaching 0.997.

The final 3-model weighted ensemble produced a state-of-the-art OOF AUC of 0.9973. Critically, at a standard 0.5 decision threshold, the ensemble achieved an excellent Sybil recall of 97%, demonstrating its capability to identify the vast majority of malicious actors defined within this dataset. The corresponding Sybil precision of 75% represents a reasonable trade-off, although the optimal balance might be adjusted depending on the specific costs associated with False Positives versus False Negatives in a real-world deployment.

While the model exhibits strong performance on the provided data, certain limitations should be acknowledged. The model’s effectiveness is inherently tied to the quality and representativeness of the initial Sybil labels; different labeling methodologies could yield different results. Furthermore, sophisticated adversaries continuously adapt their strategies (concept drift), potentially requiring model retraining or feature updates over time.

Future work could explore avenues for marginal improvement, such as incorporating external data (e.g., known malicious contract lists, CEX deposit address heuristics), developing complex graph-based features using Graph Neural Networks to explicitly model wallet interactions, or implementing more advanced ensembling techniques like stacking. However, given the near-perfect OOF AUC achieved, the current feature set and ensemble likely capture the bulk of the predictive signal present in this specific dataset.

In conclusion, the developed 3-model weighted ensemble provides a powerful, data-driven solution for this Sybil detection task. The rigorous methodology, combining deep EDA, comprehensive feature engineering, model optimization, and robust ensembling, resulted in a model demonstrating exceptional performance in identifying Sybil behavior based on on-chain activity. The final submission (submission_ensemble_weighted_3model.csv) encapsulates this optimized solution.

Sybil Detection Model Writeup by achankun

Participant: achankun
Email: ichsanbit45@gmail.com

Overview

This writeup describes my approach to building a machine learning model for detecting Sybil wallets in the Ethereum ecosystem. The solution combines feature engineering from blockchain transaction data with a LightGBM classifier to predict the probability of an address being a Sybil wallet.

Data Preparation

The dataset included labeled wallet addresses from both Base and Ethereum chains, along with their transaction histories:

  • Address Data:
    • Training set: 104,016 addresses (combined Base + Ethereum)
    • Test set: 19,584 addresses
  • Transaction Data:
    • Regular transactions (1.4M records)
    • Token transfers (4.5M records)
    • DEX swaps (685k records)

Key preprocessing steps:

  1. Combined Base and Ethereum chain data
  2. Standardized column names across datasets
  3. Converted numeric columns to appropriate types
  4. Validated the target variable (is_sybil) to ensure binary labels (0/1)

Feature Engineering

I created three categories of features for each wallet address:

1. Transaction Features

  • Count, sum, mean, std, and median of transaction values
  • Gas price statistics (mean, std, median)
  • Gas used statistics (mean, std, median)
  • Number of unique blocks interacted with

2. Token Transfer Features

  • Count, sum, and distribution statistics of token amounts
  • USD value statistics (when available)
  • Number of unique recipient addresses

3. DEX Swap Features

  • Count and amount statistics for swap inputs/outputs
  • USD value statistics for swaps
  • Number of unique tokens swapped

All features were merged by wallet address, with missing values filled as 0 (assuming no activity in that category).

Model Selection

I chose LightGBM for several reasons:

  • Handles tabular data effectively
  • Robust to feature scales and types
  • Efficient with large datasets
  • Built-in handling of class imbalance

Model Configuration:

lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    class_weight='balanced',
    metric='auc'
)

Training Approach:

  • 80/20 stratified split for validation
  • Early stopping after 50 rounds without improvement
  • AUC as the evaluation metric

Results

The model achieved a validation AUC of 0.923, demonstrating strong ability to distinguish between Sybil and legitimate wallets.

Top 10 Features by Importance:

  1. tx_value_count (number of transactions)
  2. token_amount_precise_count (number of token transfers)
  3. tx_gas_used_mean (average gas used)
  4. dex_amount_in_count (number of DEX swaps)
  5. token_to_address_nunique (unique recipients)
  6. tx_value_sum (total ETH transacted)
  7. tx_block_number_nunique (unique blocks)
  8. token_amount_usd_sum (total USD value)
  9. tx_gas_price_mean (average gas price)
  10. dex_amount_out_sum (total tokens received)

Key Insights

  1. Activity Patterns Matter: The count of transactions and token transfers were most predictive
  2. Economic Signals: Total value transacted and gas usage provided strong signals
  3. DEX Behavior: Swap activity was particularly indicative of Sybil behavior
  4. Network Diversity: Wallets interacting with more unique addresses/blocks were less likely to be Sybils

Conclusion

This solution demonstrates that Sybil detection can be effectively automated using transaction pattern analysis. The LightGBM model successfully learned meaningful patterns from wallet activity data while handling the inherent class imbalance.

Future improvements could include:

  • Graph-based features capturing wallet connections
  • Time-series analysis of transaction patterns
  • Ensemble approaches combining multiple models