Detection of Web API Content Scraping : An Empirical Study of Machine Learning Algorithms

Dina Jawad

Abstracts

by Dina Jawad

Institution:	KTH
Department:
Year:	2017
Keywords:	web API; scraping; machine learning; supervised learning; web API content scraping; Computer Sciences; Datavetenskap (datalogi)
Posted:	02/01/2018
Record ID:	2195868
Full text PDF:	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-210235

Abstract

Scraping is known to be difficult to detect and prevent, especially in the context of web APIs. It is in the interest of organisations that rely heavily on the content they provide through their web APIs to protect their content from scrapers. In this thesis, a machine learning approach towards detecting web API content scrapers is proposed. Three supervised machine learning algorithms were evaluated to see which would perform better on data from Spotify's web API. Data used to evaluate the classifiers consisted of aggregated HTTP request data that describes each application having sent HTTP requests to the web API over a span of two weeks. Two separate experiments were performed for each classifier, where the second experiment consisted of synthetic data for scrapers (the minority class) in addition to the original dataset. SMOTE was the algorithm used to perform oversampling in experiment two. The results show that Random Forest was the better classifier, with an MCC value of 0.692, without the use of synthetic data. For this particular problem, it is crucial that the classifier does not have a high false positive rate as legitimate usage of the web API should not be blocked. The Random Forest classifier has a low false positive rate and is therefore more favourable, and considered the strongest classifier out of the three examined. Scraping r svrt att upptcka och undvika, speciellt vad gller att upptcka applikationer som skrapar webb-APIer. Det finns ett srskilt intresse fr organisationer, som r beroende av innehllet de tillhandahller via sina webb-APIer, att skydda innehllet frn applikationer som skrapar det. I denna avhandling fresls ett tillvgagngsstt fr att upptcka dessa applikationer med hjlp av maskininlrning. Tre maskininlrningsalgoritmer utvrderades fr att se vilka som skulle fungera bst p data frn Spotify's webb-API. Data som anvndes fr att utvrdera dessa klassificerare bestod av aggregerade HTTP-request-data som beskriver varje applikation som har skickat HTTP-requests till webb-APIet under tv veckors tid. Tv separata experiment utfrdes fr varje klassificerare, dr det andra experimentet var utkat med syntetisk data fr applikationer som skrapar (minoritetsklassen) utver det ursprungliga som anvndes i frsta experimentet. SMOTE var algoritmen som anvndes fr att generera syntetisk data i experiment tv. Resultaten visar att Random Forest var den bttre klassificeraren, med ett MCC-vrde p 0,692, utan syntetisk data i det frsta experimentet. I detta fall r det viktigt att klassificeraren inte genererar mnga falska positiva resultat eftersom vanlig anvndning av ett web-API inte br blockeras. Random Forest klassificeraren genererar f falska positiva resultat och r drfr mer frdelaktig och anses vara den mest plitliga klassificeraren av de tre underskta.