Scraping Reddit API using JavaScript
In the previous article, I discussed how to plot live data from twitter using HighCharts, in this article we will see how Reddit can be scraped using only JavaScript without authorization mechanism. It is simple, easy and since we are only using Front End to fetch the data then we also don’t need a local server.
Some are the following Reddit API URLS, you can look for more details here. The new API requires the Open Auth that means you would need a local server for that. But we will use the simple version.
For Comment: https://www.reddit.com/r/all/comments/.json?limit=100
For Query with Time Duration: https://www.reddit.com/search.json?sort=new&q=(and%20%22boston%22%20(and+timestamp%3A1420070400..1425168000))&restrict_sr=on&syntax=cloudsearch
Reddit fetches 100
activities at maximum, you can use the limit parameter for that purpose, also sort parameter is very useful, and we can get the latest Reddits.
We will be using the query command and will try to fetch results after every five seconds so that it keeps adding more data to the graph.
We will plot the graph by minutes for that we need to cut of the seconds from the time.
//By Minutes, Removing Second from the Full Date/Time
d = (new Date(data[i].data.created_utc * 1000));
d = d.toString();
d = d.replace(/:\d\d([ ap]|$)/,'$1');
One of the problems we face here is that after every five seconds it will fetch data, and there is likely to be redundant values, so we will use Set so that we can keep the value by their unique ID.
if(!set.has(data[i].data.id))
{
set.add(data[i].data.id);
// Pushing count “1” for each reddit record date
mydata.push([Date.parse(new Date(d)), 1]);
}
Further, since we are finding multiple values for each date, we have to group them together by date. Also, we have to sort it out so that we can prevent the abnormal behavior of the Chart.
// Transformation of data
function sum(a) {
return a.filter(([key, num], idx) => {
var first = a.findIndex(([key2]) => key === key2);
if (first === idx) return true;
a[first][1] += num;
}); }
function sortFunction(a, b) {
if (a[0] === b[0]) {
return 0;
}
else {
return (a[0] < b[0]) ? -1 : 1;
} }
Now plotting the graph, using the HighCharts
library for visualization.
// Create the chart
Highcharts.stockChart('container', {
series: [{
name: 'No. of Reddits per given time',
data: mydata_grouped
}]});
We are not updating the graph using the event load function but instead we are plotting with newer data every time.
Again, we have not stored the data, make modification to the current version by storing the data into CSV or any database.
Full code available at Github.