Skip to main content

Web3 Wallet Analytics & Intelligence

This program trains participants in blockchain data analysis, address clustering, fund flow tracing, and on-chain anomaly detection. The 180-day curriculum is segmented into 15-day blocks with focused topics and deliverables, culminating in a robust skillset for crypto intelligence (inspired by platforms like Nansen and Arkham).

Block 1 (Days 1–15): Blockchain Data Foundations​

  • Topics Covered: Introduction to blockchain fundamentals with an analytics perspective. Review of how transactions and blocks work in major chains (Bitcoin vs Ethereum transaction models), the structure of addresses and cryptographic keys. Understanding on-chain data availability: what data is publicly accessible (addresses, transactions, smart contract logs) and the concept of pseudonymity. Tools for exploring raw blockchain data (block explorers like Etherscan, blockchain node JSON-RPC APIs). Overview of The Graph – an indexing protocol that organizes blockchain data and makes it queryable via GraphQL 17.
  • Deliverables: Setup of a development environment with access to blockchain data: e.g., a free Ethereum node API (Infura/Alchemy) or a local node. A short cheat-sheet documenting the structure of a blockchain’s data (for Ethereum: blocks, transactions, internal transactions, event logs). Simple Python scripts to fetch data: e.g., given an address, retrieve its transaction history via an API call. Also, a written summary of differences in UTXO vs account model data handling.

Block 2 (Days 16–30): On-Chain Data Extraction & Indexing​

  • Topics Covered: Techniques for extracting large-scale on-chain data for analysis. Using The Graph to index blockchain events: writing a simple subgraph for Ethereum (e.g., indexing transfers of a specific ERC-20 token). Understanding subgraph manifest, entity schema, mappings (in AssemblyScript). Introduction to Substreams for advanced indexing: how parallel processing of blockchain data can speed up queries 18. Comparison of using The Graph vs direct blockchain queries vs cloud datasets (Google BigQuery public crypto datasets). Data sources for historical blockchain data and how to gather data across multiple chains.
  • Deliverables: Implement and deploy a basic The Graph subgraph (or run it locally) for a chosen contract (for instance, track all transfers of a token or all trades on a DEX). Provide GraphQL queries and outputs demonstrating data retrieval (e.g., “list of all accounts that received the token”). Additionally, outline a plan for a Substreams module: describe in pseudo-code how you would extract, say, all ERC-20 transfers using a Rust-based Substreams approach (actual coding optional). Provide a short reflection on the benefits of parallel indexing with Substreams 19.

Block 3 (Days 31–45): Graph Analysis & Address Clustering Techniques​

  • Topics Covered: Graph theory applied to blockchain: addresses as nodes, transactions as edges. Techniques for address clustering – identifying which addresses likely belong to the same entity. Bitcoin clustering heuristics (e.g., common input heuristic for identifying wallets). Ethereum clustering considerations: linking addresses via usage patterns (contracts that funnel funds to the same address, timing analysis, etc.). Overview of heuristics used by analytics firms: identifying exchange wallets (through known patterns or tagging), DeFi addresses, etc. Introduction to Arkham Intelligence and Nansen approaches: labeling and clustering millions of addresses (Nansen boasts 300M+ labeled addresses 20). Discussion of privacy vs transparency: how Arkham clusters addresses to real-world identities 21 and the ethical implications.
  • Deliverables: Use a graph analysis library (NetworkX in Python, or Neo4j) to create a small transaction graph. For example, take an Ethereum address and its counterparties up to 2 hops out, and visualize or analyze connectivity. Implement a simple clustering heuristic: for Bitcoin data (if accessible), cluster addresses that appear together as inputs; for Ethereum, perhaps cluster addresses by co-interaction with a specific DeFi contract or by time correlation. Deliver a Jupyter Notebook with the analysis steps and a visualization (graph diagram or adjacency matrix). Also include a brief write-up of how companies like Arkham cluster addresses and the potential inaccuracies or controversies.

Block 4 (Days 46–60): Fund Flow Tracing​

  • Topics Covered: Techniques for tracing fund flows through blockchain networks. How to follow the money through multiple hops: transaction graph traversal, identifying mixers or tumblers (e.g., Tornado Cash on Ethereum) and how they obfuscate flows. Case studies of fund tracing: tracing stolen crypto from a hack, following an Ethereum phishing scam’s proceeds across addresses. Tools for automated flow tracing and visualization (e.g., GraphSense, which is an open-source toolkit for crypto fund flow analysis). Concepts of taint analysis in Bitcoin and token flow analysis in Ethereum. Introduction to The Graph or custom scripts to query multi-hop flows.
  • Deliverables: A step-by-step analysis of a sample fund flow. For instance, trace a transaction from a known large address (could use a public example like the flow of ETH from the DAO hack or a simpler made-up scenario) through intermediate addresses. Represent this flow as a chain of transactions in a diagram or graph. Provide a small script that, given a starting address and a target address, attempts to find a path of transactions connecting them (using breadth-first search on the graph of transactions). Deliver the script and a report of a successful trace (or explanation of obstacles if the trace goes through a mixer or is too complex).

Block 5 (Days 61–75): Advanced Clustering & Entity Attribution​

  • Topics Covered: Building upon basic clustering to more advanced entity attribution. Using off-chain data and heuristics to label clusters: e.g., an address that interacted with known exchange deposit addresses can be labeled as belonging to a certain exchange (use of open tags from Etherscan or wallet databases). Machine learning approaches to clustering: features like transaction frequency, active hours (time-of-day usage may cluster addresses controlled by the same person in the same timezone), gas price patterns (for Ethereum) etc. Introduction to address profiling: identifying what type of user an address is (e.g., “NFT collector”, “arbitrage bot”, “whale investor”) based on behavior. Review of Nansen’s labeling categories (exchange, fund, smart money, etc.). Data enrichment via services like ENS (if an address has a human-readable name) or other identity links.
  • Deliverables: Construct a feature set for addresses and perform a basic clustering or classification. For example, take a set of Ethereum addresses and compute features: number of transactions, distinct counterparties, DeFi protocols used (if any), etc. Use unsupervised clustering (e.g., K-means) to see if natural groupings emerge (perhaps distinguishing “exchange hot wallets” vs “individual user wallets” in a small sample). Alternately, if labels are available (from Etherscan’s public tags or datasets), attempt a supervised classification of addresses. Deliver a report with findings—did the clustering find meaningful groupings? Include any charts or cluster visualizations.

Block 6 (Days 76–90): On-Chain Interaction & Behavior Analysis​

  • Topics Covered: Interaction markers on-chain – analyzing how addresses interact with smart contracts and what that reveals. Identifying users by their DeFi activities (e.g., addresses that frequently interact with Uniswap vs those using lending protocols). Studying NFT activity patterns for profiling (e.g., an address buying certain NFT collections could mark it as an NFT trader). Cross-chain activity: using data from multiple chains to see if the same entity is active across them (if the program scope includes multi-chain). Social network analysis on blockchain: treating interactions (like ETH transfers, contract calls) as edges to identify communities or central players. Exploration of specific tools like Dune Analytics (for SQL querying onchain data) to get quick insights.
  • Deliverables: A case study analysis of a single address or a small set: for example, pick a known whale address (perhaps identified via Nansen’s “smart money” label or a known ENS name) and analyze its interactions. Produce an “address dossier” that describes: what protocols it uses most, any notable tokens or NFTs it holds, patterns in its behavior (e.g., tends to do trades at certain times or follows certain strategies). Use on-chain queries (via The Graph, Dune, or custom scripts) to gather this info. The deliverable is the dossier report, citing how you derived each insight (with query code or data evidence). This mimics how an Arkham or Nansen analyst might profile a wallet.

Block 7 (Days 91–105): Anomaly Detection in Blockchain Activity​

  • Topics Covered: Identifying anomalies and suspicious patterns on-chain. Types of anomalies: sudden large transfers from a dormant address, flash loan attacks (large in/out in one transaction), abnormal gas usage patterns indicating potential MEV (miner extractable value) attacks. Techniques: statistical anomaly detection (e.g., addresses whose activity spikes abnormally), graph-based anomalies (subgraphs of the transaction network that have unusual structure, possibly indicating fraud rings). Introduction to unsupervised learning methods like isolation forests or autoencoders applied to blockchain time-series data. Discuss known examples of on-chain anomalies: detecting Ponzi schemes or rug pulls by patterns of fund flows, sybil attack detection in airdrops (multiple addresses created to abuse airdrop, as Nansen did for Arbitrum sybil detection).
  • Deliverables: Develop a simple anomaly detection script for a dataset of transactions. For example, take a list of daily transaction counts or volumes for various addresses and use a statistical method to flag outliers (could be as simple as z-score outlier detection). Alternatively, implement a rule-based detector for a specific anomaly: e.g., flag if an address drains all funds to a new address and then that new address tumbles funds through a mixer. Test the detector on historical data (if available, e.g., find if it would have caught a known exploit pattern). Provide the code and a brief analysis of results (did it catch the expected anomalies? what’s the false positive rate?).

Block 8 (Days 106–120): Platform Case Studies: Nansen, Arkham, The Graph​

  • Topics Covered: Studying real-world platforms to consolidate understanding. Nansen case study: features of Nansen’s platform – real-time dashboards, labeled address database, tools like “smart money” tracking, and how it uses clustering and heuristics at scale. Arkham Intelligence case study: their approach of deanonymizing through user-driven labeling marketplace and visualization tools (Arkham’s graph visualizer for address relationships). Discussion on how these platforms handle data ingestion (possibly using The Graph or their proprietary indexers), and the technologies they likely use (databases, big data processing, etc.). Also look at The Graph’s network itself as a case: how subgraphs are deployed and queried in production by applications like Uniswap’s analytics, etc. This block is about connecting the dots between the training content and industry practice.
  • Deliverables: An analytical essay (~2-3 pages) comparing Nansen and Arkham (and optionally one more, like Chainalysis or a public block explorer) in terms of functionality and techniques. Trainees should cite specific features – e.g., “Nansen labels 300M+ addresses across 20+ chains and provides profiles categorizing wallets (investor, exchange, NFT collector, etc.), whereas Arkham focuses on a marketplace model for label data.” Also include screenshots or diagrams (if available) to illustrate a feature like an Arkham network graph or Nansen’s interface (optional). The essay should highlight how the course-taught techniques manifest in these tools.

Block 9 (Days 121–135): Building a Mini Analytics Pipeline​

  • Topics Covered: Design and implementation of an end-to-end wallet analytics pipeline on a smaller scale. This involves: data ingestion (from a blockchain API or subgraph), data storage (using a database or in-memory structures), analysis (clustering, tagging, anomaly detection), and visualization. Choice of tech stack for each part (e.g., using Python with SQLite or Pandas for a quick prototype vs using a graph database like Neo4j for relationships). Ensuring the pipeline is modular (so it can be updated for new heuristics or extended to new data). Consideration of scalability – how would one scale this to millions of addresses (discuss big data tools, but implementation remains small-scale).
  • Deliverables: A working Wallet Analytics Pipeline Prototype. For instance: a Python program that takes a list of addresses as input, fetches their recent transactions (using an API or a downloaded dataset), stores them, performs a simple clustering or labeling (e.g., flag if any address is an exchange by checking against a small known list, cluster addresses that transact frequently with each other), and outputs a human-readable report or visualization. The output could be, say, a summary of clusters found and any interesting behaviors (like “Address X and Y appear to operate together, and moved large funds on the same days”). Provide the source code (well-documented) and a sample run on a test set of addresses with output.

Block 10 (Days 136–150): Visualization and Reporting​

  • Topics Covered: Techniques for effective visualization of on-chain data and communication of insights. Graph visualizations for address networks (using libraries like D3.js, Gephi, or Neo4j Bloom). Time-series charts for balance changes or transaction volumes. Designing dashboards (taking inspiration from Nansen’s UI or Dune Analytics dashboards). Ensuring visualizations can handle the complexity (e.g., simplifying graph by aggregating known entities). Also, generating concise reports from analysis, suitable for executives or investigators (like an intelligence dossier). Utilizing tools like Jupyter for interactive exploration and output to HTML/PDF reports.
  • Deliverables: A set of visualizations created from the data in the pipeline built in Block 9. Examples: a graph showing clusters of addresses (perhaps using NetworkX to output and matplotlib to draw, or exporting to Gephi for a more polished graph – include a screenshot if external tool used); a timeline of transactions for a chosen address cluster; a pie chart or bar chart of categories of transactions for an address. Package these visuals into a brief “On-Chain Intelligence Report” for a fictitious scenario (e.g., investigating an address suspected of fraud). The report (2-3 pages) should narrate the findings with the visuals embedded.

Block 11 (Days 151–165): Scaling and Automation Considerations​

  • Topics Covered: Pushing the prototype towards a production-grade system. Discussion on scaling issues: using databases (SQL vs NoSQL vs graph databases) for millions of transactions, handling real-time data (streams of new blocks), and integrating multiple chains. Introduction to automation: scheduling regular data pulls, updating labels as new information comes (e.g., new exchange addresses identified). API design to expose the analytics (if one were to build a service like Nansen Query API). Security and privacy considerations when aggregating data (e.g., ensuring compliance if needed). Also, brief introduction to smart contract analytics (for completeness): analyzing contract code or events for vulnerabilities or patterns (could be a tangent if time permits).
  • Deliverables: An architecture proposal document that outlines how the current pipeline (from Block 9/10) could be extended to handle much larger scale and be used in an organizational setting. This document should include choices of technologies for each component (e.g., “use a graph database like Neo4j or TigerGraph to store the address graph, use Apache Spark or Dask for distributed processing of large datasets, etc.”) and justify them. It should also mention what additional features could be added (e.g., a web front-end, an alerting system for anomalies). Essentially, this is a design blueprint rather than code – the deliverable is a well-thought-out plan (diagrams encouraged) rather than a implemented system at scale.

Block 12 (Days 166–180): Capstone Project – On-Chain Intelligence Report​

  • Topics Covered: Synthesis of all skills into a final capstone. Students act as on-chain analysts: they pick a real-world scenario or dataset (or one provided by instructors) and perform a complete analysis using the developed tools and methods. Emphasis on end-to-end execution: data gathering, cleaning, analysis (clustering, flow, anomaly detection), and insight generation. This also involves project management skills – defining the scope (which aspect to focus on: e.g., analyzing the behavior of a known DeFi hacker, or mapping out the ecosystem of a new blockchain protocol), and executing within time. Peer review of projects to simulate presentation to stakeholders.
  • Deliverables: A comprehensive Web3 Intelligence Report & Presentation. For example, a report investigating “The token flow and entity network behind XYZ exploit” or “Profiling the top 10 holders of ABC token and their interactions”. The report should include narrative, data findings, visuals, and references (if using any external data or research). The presentation (slide deck or live demo) should highlight key insights as if to an audience of decision-makers. Along with the report, submit the supporting analysis code (notebook or scripts) used to derive the findings. The outcome demonstrates proficiency in wallet analytics, akin to an internal Arkham/Nansen analyst report with clear, data-backed conclusions.

Full Tech Stack & Tools:​

  • Blockchain Data Access: Ethereum (and optionally other chain) nodes via JSON-RPC, web3.py for scripting queries; The Graph’s CLI and Graph Node for subgraph deployment; Dune Analytics (SQL) for quick queries; Substreams (Rust SDK) for advanced indexing needs.
  • Data Storage & Processing: Python with Pandas for small data; NetworkX for graph analysis in-memory; Neo4j or TigerGraph for persistent graph database use (optional, for advanced graph queries); SQL databases or NoSQL (MongoDB) for storing transaction datasets. 28
  • Analytics & ML: Scikit-learn for clustering and anomaly detection algorithms; Jupyter notebooks for iterative analysis; libraries like ethplotlib or custom code for blockchain-specific visualizations.
  • APIs & Libraries: web3.py and etherscan-python for retrieving on-chain data; bitcoinlib or similar for Bitcoin data if needed; Nansen’s API (if available for academic use) or other blockchain analytics APIs for enrichment; ENS lookup libraries for resolving Ethereum Name Service data.
  • Visualization: Matplotlib/Seaborn for basic charts; network visualization tools (NetworkX drawing, Gephi for more complex visuals, Plotly for interactive graphs); possibly front-end frameworks (React/D3.js) for a custom dashboard if building one.
  • DevOps & Scaling: Docker containers for running indexing