Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Simulates the data transfer to explore caching potential in network nodes running Hadoop over NDN (Named Data Networking) rather than traditional TCP/IP.
A simple Hadoop-like distributed computing platform implemented in Java. [This is a course project at UIUC (awarded the best Java version implementation) and it's open-sourced for reference.]
Fork of http://asterixdb.ics.uci.edu/fuzzyjoin/ Efficient Parallel Set-Similarity Joins Using MapReduce. Rares Vernica, Michael J. Carey, Chen Li SIGMOD 2010