E-Mail Address Harvesting

The presence of unsolicited bulk e-mail (spam), which has exceeded the volume of legitimate e-mail, remains a costly economic problem. Notwithstanding existing counteracting measures, spamming campaigns advertising products are profitable even when the amount of purchases being made is small relative to the amount of spam. The apparent success of spamming campaigns motivates the understanding of spamming trends and their economics, which may provide insights into more efficient counteracting measures.

Few studies address the origins of the spamming process, e.g., concerning address harvesting, which remains the primary means for spammers to obtain new target addresses. Addresses can be harvested in multiple ways, e.g., from public web pages by using crawlers or by malicious software locally running on compromised machines.

To explore the origins of the spamming process, this paper conducts a large scale study involving addresses harvested from public web pages. Our overall study provides thus an up-to-date view on spam origins which further reveals guidelines for webmasters to protect e-mail addresses.

Main Findings

Our main findings can be summarized as follows:

  • Search engines are used as proxies, either for hiding the identity of the harvester or for optimizing the harvesting process and
  • simple obfuscation methods, e.g., emailid [at] domain [dot] tld, are still efficient for protecting addresses from being harvested.

Other Findings

  • Harvesting on our web sites is on the decline.
  • Addresses that were harvested from our pages are mainly spammed in batches and are only used for a short time period.
  • Harvester bots are still mainly run in (residential) access networks.
  • One interpretation of our results is that only a few parties are involved in address harvesting, each causing different spam volumes.
  • Our findings also suggest that the usage of some harvesting software is stable.
  • Harvesters make little use of Tor as anonymity service to hide their identity.

Data Set

To identify address harvesting crawlers, we have embedded more than 23 million unique spamtrap addresses in more than 3 million visits to web pages over the course of more than three years, starting in May of 2009.

We provide the data set in partly anonymized form for you to verify and extend our analysis.



This data set is made available under the Open Database License. Any rights in individual contents of the database are licensed under the Database Contents License.

Related Publications

Hohlfeld, Oliver and Graf, Thomas and Ciucu, Florin (2012). Longtime Behavior of Harvesting Spam Bots. ACM Internet Measurement Conference (IMC)

Oliver Hohlfeld Photo

Oliver Hohlfeld
Communication and Distributed Systems
RWTH Aachen University
Ahornstr. 55
52074 Aachen, Germany
Email: oliver at comsys /DOT/ rwth-aachen.de
My group
Network Architectures research group
@ohohlfeld Twitter
Google Scholar

Twitter Updates