Sybil Detection Model Writeup: Crypto Pond Competition.

Participant: Ujjwal kumar

Email : gautamujjwall513@gmail.com.

Model Approach for Predicting Sybil Scores of Wallets
Objective:
The goal of this project was to predict the Sybil Scores of wallets. Sybil scores are used to identify fraudulent or “Sybil” accounts that may not behave like legitimate users in a system. A higher Sybil score indicates that a wallet is more likely to be fraudulent or manipulated.

Data Processing:
The dataset consists of wallet attributes such as transaction history, wallet balances, and activity patterns. The following data preprocessing steps were performed:

Missing Values: Any missing values in the dataset were handled through imputation using the mean or median, depending on the feature type.

Feature Encoding: Categorical variables, such as wallet types or transaction categories, were encoded using one-hot encoding or label encoding.

Feature Scaling: Continuous variables were normalized to ensure all features were on the same scale, using min-max normalization or standardization.

Data Splitting: The data was split into training and validation sets using stratified K-fold cross-validation to ensure that each fold contains a similar distribution of Sybil and non-Sybil wallets.

Feature Engineering:
Several new features were engineered from the raw data to capture wallet behavior patterns:

Transaction Frequency: The number of transactions per day/month/year to capture how active the wallet is.

Average Transaction Size: This feature captures the typical size of transactions made by a wallet.

Balance Patterns: Whether the wallet shows unusual balance fluctuations (could indicate potential manipulation).

Activity Patterns: The times during which a wallet is most active (e.g., if a wallet shows activity at unusual times, it might indicate suspicious behavior).

Modeling:
For the predictive task, we selected LightGBM (Light Gradient Boosting Machine), a gradient boosting model. LightGBM is known for its efficiency in handling large datasets and its ability to provide accurate results with minimal tuning. Here’s a brief overview of the steps followed in the modeling process:

Model Choice: LightGBM was chosen for its speed, efficiency, and strong performance on structured datasets. It naturally handles missing values and can work with large datasets.

Hyperparameter Tuning: The hyperparameters were tuned to ensure optimal performance. Key parameters tuned included:

Learning rate: 0.01

Maximum depth: 7

Number of leaves: 31

Regularization parameters (L1 and L2) to prevent overfitting.

Cross-validation: To assess the performance of the model, 5-fold cross-validation was used, with AUC (Area Under the Curve) and binary log-loss as the evaluation metrics.

Performance:
The model achieved excellent performance, with an AUC of 0.936 on the validation set. The performance was consistent across different folds, with the mean AUC score being 0.9364. The binary log-loss was 0.0614, indicating that the model’s predictions were well-calibrated.

Fold 5 AUC: 0.9363 (indicating consistency in performance across folds)

Mean AUC: 0.9364

The results suggest that the model is able to effectively distinguish between legitimate and fraudulent wallets, as indicated by the high AUC score.

Challenges Faced:
Some challenges encountered during model development included:

Imbalanced Classes: The dataset had an imbalance between legitimate and Sybil wallets. To mitigate this, class weights were adjusted during model training, and sampling techniques (e.g., oversampling) were considered to balance the dataset.

Feature Selection: Identifying which features were most influential in predicting Sybil scores was an iterative process. Extensive feature engineering was necessary to capture meaningful patterns in wallet behavior.

Next Steps:
To further improve the model, the following steps could be explored:

Model Comparison: While LightGBM performed well, other models like XGBoost, CatBoost, or even deep learning models could be tested for comparison to see if they can yield better performance.
Hyperparameter Optimization: More exhaustive hyperparameter tuning, possibly using Bayesian optimization or randomized search, could be employed to fine-tune the model further.

Ensemble Methods: Combining multiple models through an ensemble approach (e.g., stacking or bagging) may improve prediction stability and generalization.

Conclusion:
The model developed demonstrates strong performance in predicting Sybil scores for wallets, with an AUC of 0.9364 and a binary log-loss of 0.0614. The use of LightGBM was effective, and future improvements could involve exploring different model architectures, fine-tuning hyperparameters, or using ensemble methods for even better performance. This model has the potential to be deployed in systems where detecting fraudulent or Sybil wallets is critical.