Cross-Platform Social Profiling
This program covers Open-Source Intelligence (OSINT) pipelines to collect and merge user data across social media platforms. Over 180 days, participants will build skills in data extraction, identity resolution, and enrichment with public records, organized in 15-day blocks with cumulative deliverables.
Block 1 (Days 1–15): OSINT Fundamentals & Social Data Overview
- Topics Covered: Introduction to Open-Source Intelligence (OSINT) and its relevance in profiling individuals across public data sources. Legal and ethical considerations of collecting public data (privacy, terms of service of social platforms). Overview of major social platforms (Facebook, Twitter, Instagram, LinkedIn, Reddit, etc.) and the types of publicly available data each provides (profiles, posts, connections). Concept of a unified “digital persona” and the challenge that user information is fragmented across platforms. High-level discussion of methodologies for linking identities (username consistency, email reuse, etc.).
- Deliverables: A short primer document that lists key OSINT resources and tools (mentioning, e.g., search engines, data breach databases, people search sites) and summarizes the data available from at least 3 major social networks. Additionally, each trainee selects a public figure or fictional persona and maps out what sources would likely contain their data (e.g., “John Doe is on Twitter, LinkedIn; possible public records in State X, etc.”) – deliverable in the form of a simple mind-map or list.
Block 2 (Days 16–30): Social Media Data Extraction Techniques
- Topics Covered: Methods to extract data from social platforms: Official APIs vs scraping. Learning to use Twitter API (now X API) for tweets and user info, the Reddit API for posts and comments, etc. Understanding API limitations (rate limits, data fields available, need for API keys). For platforms with restricted APIs (Instagram, Facebook), explore approved methods (Facebook Graph API for developers, LinkedIn API for enterprise) and alternatives like web scraping (requests/BeautifulSoup, Selenium for dynamic content). Introduction to OSINT tools like Scrapy (a Python scraping framework) for automating data collection. Emphasis on data format (JSON from APIs, HTML parsing for scrapers) and error handling (e.g., being blocked by anti-scraping measures).
- Deliverables: Implement at least two data extraction scripts: for example, a Twitter harvester that given a username, fetches recent tweets and follower count using Tweepy (Twitter’s Python SDK), and a Web scraper that given a LinkedIn public profile URL (or an alternate site if LinkedIn is too restricted), extracts name, current job, and education details. Provide the code and sample output data (JSON or CSV). Additionally, log any obstacles encountered (e.g., needing API credentials or bypassing anti-bot checks) and how they were addressed.
Block 3 (Days 31–45): Data Normalization & Storage
- Topics Covered: After extraction, data comes in varied formats – this block covers normalizing and storing heterogeneous social data. Designing a unified schema for person profiles: e.g., fields like name, username, platform, bio, followers_count, etc., accommodating platform-specific fields. Techniques to clean and standardize data (handling different date formats, location names, etc.). Storage solutions: using a relational database (e.g., PostgreSQL) to store profiles and relationships, or a document database (MongoDB) for flexible schemas. Introduction to graph databases (Neo4j) for representing entities (people) and edges (account on platform, follows, etc.). Decision criteria for storage: if focus is on relationships, graph DB might be ideal; if just aggregating attributes, SQL/NoSQL might suffice.
- Deliverables: Design a unified profile database schema and implement it in a chosen database. Populate it with example data: use data from Block 2 (tweets, one LinkedIn profile, etc.) and insert into the schema. For example, one table (or collection) could be Person, another for Account (with foreign key to Person), another for Posts. Deliverable is a brief schema documentation (illustrating how one person’s multi-platform data is represented) and a dump or screenshot of the database with some entries. Also include a script or ETL notebook that takes raw JSON from an API and inserts normalized records into the database.
Block 4 (Days 46–60): Attribute Matching and Identity Resolution
- Topics Covered: Core techniques for aligning accounts across platforms by attributes. Username correlation: Checking if the same username or slight variations exist on different sites (introduce tools like Sherlock, which checks hundreds of sites for a given username). Profile attributes: Matching real names, bio descriptions, profile photos, locations – for example, if two profiles share a unique name + city, likelihood they are same person. String similarity metrics (Levenshtein distance, Jaro-Winkler) for names/handles. Handling cases where users use different names: the role of external clues (e.g., linking via a common email or phone number if found). Introduction to the concept of User Identity Linkage in research: how multiple features (profile, content, network) can be combined.
- Deliverables: An identity matching script that takes two profile data (from the DB or JSON) and outputs a similarity score. Implement simple comparisons: e.g., same username = high score, else if real names are available, compute similarity, check if locations match, etc., then aggregate to an overall confidence score. Test this on a few known pairs (create dummy profiles that you know are same vs different). Additionally, run Sherlock (or similar username search tool) for a chosen username and provide the results as evidence of how one might find a person’s other accounts (for example, show that a unique username “alice123” was found on Twitter, Instagram, and a GitHub profile). Include those results in the deliverable.
Block 5 (Days 61–75): Image and Facial Matching Across Platforms
- Topics Covered: Using visual data for linking profiles. Many users reuse profile photos – so applying facial recognition or image hashing to profile pictures across accounts. Introduction to face recognition technology (OpenCV or Dlib with pretrained face embeddings) to compare profile pictures. Handling different image sizes or slight variations (the concept of perceptual hashing for images). Also consider avatars or non-face images (where reverse image search via services or libraries might help). Ethical considerations of facial recognition in OSINT. Additionally, exploring if EXIF metadata from images can provide clues (though most social sites strip EXIF, but worth mentioning for completeness).
- Deliverables: A notebook or script that, given two images, computes a similarity (e.g., using a face recognition library to output an embedding and then distance between embeddings). Use it on a set of profile images: for example, take one person’s Twitter profile pic and their LinkedIn pic and see if the system deems them the same person. If actual images of the same individual are not available, use a public dataset of face images to simulate (or use your own images with consent). Deliverable includes the code and the results (e.g., distance scores) and a conclusion like “Images A and B likely the same person, Images A and C not the same.” Optionally, demonstrate use of a reverse image search API or tool on a profile picture and document if it finds the same picture on other sites.
Block 6 (Days 76–90): Behavioral Fingerprinting & Activity Analysis
- Topics Covered: Going beyond static profile info, use behavioral patterns to link accounts. Time-based analysis: e.g., if the same user posts on Twitter and Reddit, do they have similar active hours? Writing style analysis (stylometry): analyzing the text of posts or tweets for linguistic similarities (common phrases, punctuation, vocabulary breadth). Social network patterns: do two accounts follow or friend a similar set of other accounts (e.g., two LinkedIn profiles that have many mutual connections or two Twitter accounts following the same niche influencers)? Introduction to basic stylometry tools or DIY analysis (e.g., use Python’s
scikit-learn
to vectorize text and compare). Also, consider device/browser fingerprints if available (often not from public data, but mention how platforms detect sockpuppets via behavior). Research insight: “These features might include profile attributes, interaction patterns, and network connections” in identity linkage – highlighting multi-faceted linking. - Deliverables: Analyze a sample of posts from two accounts (could be two accounts you suspect are the same person, or even two of your own accounts) for behavioral similarities. For instance, collect the last 50 tweets of UserA and last 50 Reddit comments of UserB. Compute simple metrics: average post length, top 10 most common words (excluding stopwords), posting time distribution across 24h. Provide a comparison to see if they align. Alternatively, use a stylometry library (or write a small one) to produce a similarity score between the writing styles. Deliverable is a brief report with these findings and a judgment of whether the behavior indicates a likely match. Include graphs (like a histogram of posting times) if useful.
Block 7 (Days 91–105): Public Records and External Data Enrichment
- Topics Covered: Enhancing social profiles with data from outside social media. Introduction to public record sources: voter registration databases, property records, corporate registries, academic publications – any sources that can tie to a person’s name or location. How to use search engines effectively (advanced queries to find a name in various contexts). Using data breach search (HaveIBeenPwned, dehashed) to find emails linked to usernames, which might in turn link to real names. WHOIS data for personal websites or domains (if the person has a personal site). Phone number OSINT (using sites that check if a phone is in some registries). Automation tools: overview of frameworks like Spiderfoot which automate many OSINT enrichment steps. Emphasize verifying information and avoiding false positives.
- Deliverables: Perform an enrichment on a fictional or real case (perhaps yourself or a consenting colleague as the subject). Start with a social profile (e.g., LinkedIn of John Doe) and demonstrate how to find additional info: search for their name in a public records database (if accessible) or Google for their email/username. Use at least one automated tool (like running Spiderfoot with a person’s name or email as input) and collect interesting findings. Present the results as an “Enrichment summary”: e.g., “LinkedIn says John Doe works at X; found a public record of a business registration in John’s name in 2019 in California; found an email via a data leak search that matches his name, etc.” Include screenshots or outputs from any tool used as evidence, if possible.
Block 8 (Days 106–120): OSINT Tools & Pipelines Integration
- Topics Covered: Survey of popular OSINT tools that can be integrated into a pipeline. For social media specifically: Sherlock (username search), Maigret (similar multi-site search), Spiderfoot (automated OSINT with modules for social, domains, breaches), Maltego (visual link analysis with transforms for social links), and others like Recon-ng. Discuss how these tools can be combined or orchestrated. Building a pipeline means possibly calling these tools programmatically or using their APIs. Consider designing a workflow: e.g., input a username -> run Sherlock -> take found profiles -> scrape details -> run images through face matcher -> query public records, etc. Error handling and data management in such a pipeline.
- Deliverables: Develop a mini OSINT pipeline script that chains two or more tools/steps. For example: given a username, automatically run Sherlock (via command line) to find sites, then for one of the found profiles, fetch additional data via an API. Or, given a person’s name, search across multiple platforms (you could call Sherlock for name variants, and also search LinkedIn via Google). The deliverable is the code for this pipeline and a demonstration on a test case. Provide the output log or results obtained for the test case, showing how the pipeline automates multi-step data gathering. If using external tools, document how they were invoked (e.g., calling Sherlock’s CLI from Python).
Block 9 (Days 121–135): Entity Resolution & Graph Building
- Topics Covered: Bringing all the pieces together to actually link accounts to a unified entity. Use of graph databases or networks where nodes represent identities or accounts and edges represent “possible same person” or connections (like one account follows another, or two accounts share an email). Developing an entity resolution algorithm: taking all the similarity scores and evidence from previous blocks to decide if accounts X and Y are the same person. Could use a rule-based approach (if above certain thresholds) or even a machine learning approach (train a classifier on known linked vs unlinked account pairs if data is available). Importance of a human-in-the-loop: flagging matches with confidence levels for a human analyst to review. Visualization of the merged identity graph, e.g., using a tool like Neo4j Bloom or Gephi to display a person with links to all their accounts and attributes.
- Deliverables: Build an Identity Graph for a test subject. For example, consolidate all data collected about “Person A” from previous exercises (Twitter, LinkedIn, maybe Instagram, plus any public record found) into a graph structure. Provide this either as a diagram (nodes labeled, edges drawn) or actually populate a Neo4j database and export a visualization. Accompany the graph with an explanation: e.g., “Node1 (Twitter JohnDoe) and Node2 (LinkedIn John A. Doe) are linked by edge ‘same_name_and_employer’ with weight 0.9 indicating high confidence they are same entity.” Essentially, demonstrate the resolution of separate data points into one entity. Deliverable can be a PDF or image of the graph plus a description of the linking decisions made.
Block 10 (Days 136–150): Pipeline Integration & Automation
- Topics Covered: Finalizing the end-to-end pipeline for cross-platform profiling. This involves integrating extraction (Blocks 2–3), alignment (Blocks 4–5–6–9), and enrichment (Block 7) into one cohesive system. Addressing practical aspects: rate limiting (need to pause between API calls), data volume management (caching results to avoid re-fetching, since OSINT can produce a lot of data), and modular design (maybe using a workflow engine or just structured functions for each step). Ensuring the pipeline is well-documented and configurable (e.g., easy to switch target username or add a new platform module). Testing the pipeline on a fresh subject to refine any bugs.
- Deliverables: The Cross-Platform Profiling Pipeline codebase, delivered as a structured project (could be a Python package or a set of scripts). Include a README that explains how to use it 13 (e.g., “provide a target username or name, then run main.py, which will output a consolidated profile report”). Run the pipeline for a sample subject (perhaps a known person with several social presences, or a fictional composite) and provide the final output produced – likely a consolidated profile or report. This output could be a JSON file with merged fields or a formatted report listing all identified accounts and info. Emphasize in the documentation how each module contributes (for instance, “module1: search usernames (uses Sherlock), module2: fetch Twitter data, module3: image compare, module4: compile results into graph”).
Block 11 (Days 151–165): Evaluation, Ethics, and Privacy
- Topics Covered: Evaluating the effectiveness of the profiling pipeline. Metrics could include: accuracy of linkages (if ground truth is known for test cases), completeness (how much info was gathered), and false positives (wrongly linking two different people). Plan a test using known linked accounts (maybe create a few dummy personas to test end-to-end). Discussion on ethical use: ensuring the pipeline is used on permissible data and for legitimate purposes. How to incorporate opt-out or data minimization if needed. Debrief on current laws (GDPR, CCPA) implications if any when aggregating public personal data. Also, discuss limitations: private accounts, data that cannot be accessed via OSINT, and the risk of errors.
- Deliverables: An evaluation report of the pipeline. This should include results of testing on a set of identities (could be the trainees themselves volunteering their real social accounts as ground truth). For example: “Out of 5 known accounts for Person X, the pipeline successfully linked 4, with 1 missed (because the username was completely different and no clues linked it). No false accounts were linked.” If quantitative metrics are possible, include them. Also, a section in the report on ethical guidelines for using such a profiling pipeline in a corporate or investigative context (perhaps 5-6 bullet points summarizing best practices and cautionary points). This document effectively wraps up the lessons learned and quality of the system.
Block 12 (Days 166–180): Capstone Project – Unified Persona Investigation
- Topics Covered: The final project synthesizes the entire training: executing a full investigation on an individual or organization’s digital footprint using the developed skills and tools. Trainees treat it like a case study – for example, “Investigate the online presence of Jane Doe across platforms and open records.” This involves planning (which avenues to search), execution (running the pipeline and manual techniques), analysis of results, and presentation. Emphasis on clear documentation of findings and the confidence in each linkage. Also, this is an opportunity to refine any part of the pipeline or approach when applied to a complex real scenario.
- Deliverables: A comprehensive OSINT Profile Report on a chosen subject (could be a fictional composite if privacy is a concern, or a historical figure with plenty of public data). The report should consolidate information from multiple platforms and sources into a coherent profile: listing confirmed accounts, likely accounts (with confidence), key attributes (bio info, interests gleaned from posts, etc.), and any notable insights (for example, “this user’s Twitter indicates interest in X, which aligns with their Reddit activity on Y”). All statements should be backed by evidence (citations or screenshots in an appendix). The deliverable also includes an oral presentation or slide deck summarizing the approach and findings, simulating an internal briefing. Finally, the complete code and documentation of the profiling pipeline should be submitted, reflecting any improvements made during the capstone. This final delivery demonstrates the trainee’s ability to conduct thorough cross-platform OSINT investigations and present the intelligence gathered.
Full Tech Stack & Tools:
- Programming & Scripting: Python as the primary language (for using APIs, web scraping, data manipulation). Key libraries: Requests/BeautifulSoup for scraping HTML, Selenium for dynamic pages (if needed), Tweepy for Twitter API, PRAW for Reddit API, and any relevant SDKs for other platforms.
- OSINT Tools Integration: Sherlock 30 33 (username search automation), Spiderfoot (for automated multi-source scans) 33, and possibly Maltego (for visual link analysis, though in our pipeline we might use Neo4j instead). Using these via CLI or their Python APIs if available.
- Databases: PostgreSQL or SQLite for storing profile data; Neo4j for graph-based identity resolution if utilized. ElasticSearch if full-text search across collected data is needed (optional).
- Data Analysis: Pandas for cleaning and combining data; simple NLP or stylometry tools (NLTK, scikit-learn for vectorizing text) for behavioral analysis. OpenCV/Dlib or
face_recognition
library for image comparison of faces. - External APIs & Services: Various platform APIs (Twitter, Reddit, etc.); haveibeenpwned API for breach checks; Google Custom Search or Bing Search API for general web searches; ClearBit or FullContact APIs (if available for email/person lookups, though they are commercial).
- Workflow & Automation: Jupyter notebooks for development and testing of techniques; then organizing code into Python modules for the integrated pipeline. Possibly Airflow or simple scheduling for running multi-step processes (if simulating continuous monitoring, though not required).
- Documentation & Reporting: Use of Markdown and Jupyter for documenting steps; graph visualization either via Neo4j Bloom or converting the graph to a visual format (networkx with matplotlib, or Graphviz). Standard office tools for compiling final reports (or LaTeX if producing a polished PDF report). 31 32
- References & Learning Resources: Academic research on user identity linkage (to justify methods), OSINT community guides and forums (e.g., Bellingcat’s online investigation toolkit), official docs of platforms for what data is accessible, and ethical guidelines (e.g., intel techniques that are open source and legal). GitHub repositories of popular OSINT tools (Sherlock, Spiderfoot, Recon-ng) to understand their usage and incorporate ideas.