No Cost Data Scraping With GitHub Actions And Neo4j Aura
Working with data from the Lobsters social news site
In this livestream we'll take a look at using GitHub Actions for periodically scraping data and importing into Neo4j Aura.
Links And Resources#
hi everyone welcome to the neo4j stream uh my name is will i know it's been a little bit since i have been on the stream but uh we're getting back into it today great so thanks for joining uh today's gonna be pretty fun uh what we're going to do is take a look at importing data into neo4j using github actions so there's this really neat workflow that i discovered in the last couple of weeks with using github actions and specifically a new action that the github team released a couple weeks ago called flat data so we'll talk a little bit about what are github actions how can we use this specific action called flat data to schedule periodic data imports or kind of like data scraping from a url or from another database and then we're going to use neo4j aura to import that data in the cloud into a neo4j instance and the data set specifically that we're going to look is the lobsters data set so lobsters let me zoom in a little bit lobsters this is like a social news aggregator so kind of like hacker news if you're familiar with hacker news so you can submit articles and then they get voted up and down and there's a front page that has kind of the the hottest most popular articles that folks are interested in there's uh some comments and a comment thread uh each one of these articles has tags as well so like this is an article about databases that is tagged scaling here we can see the other tags so this is really interesting data set to me because i've been thinking more about how do we surface relevant content to users based on their interests and could we use this as a sort of news recommendation data set and one thing that's really nice about lobsters is they publish a couple of json endpoints so if we just go to and let me drop a link to the lobster's chat too so you can see what we're talking about here so there's a link to the lobsters site and then specifically lobsters has a newest json feed so this is a json endpoint that has the newest articles submitted so these are not necessarily the ones on the top page they go first to the newest feed and then users can kind of choose to upvote down vote comment on these then there's this algorithm that is basically looking at what has a lot of points over a short amount of time and that those sort of surface to the top and make the front page so anyways what we're going to do is use this lobster's newest json feed and what i want to do is periodically import this json file because this is this is changing right this is not this is not static so uh as of right now these are the newest but as new people submit articles we want to be able to capture those so i want to set up some system where i'm capturing all the new articles that are being submitted and loading those into my neo4j aura database so that i can do some analysis maybe look at the structure of this sort of social network look at what are interesting articles can i figure out based on articles you've submitted other articles you might be interested in this this kind of thing but really we could do this with any any url using github actions cool so um before we jump into github actions i guess let's start off by setting up our neo4j aura instance so if you're not familiar with neo4j aura this is neo4j managed database as a service so we can spin up neo4j aura instances in the cloud kind of scale them up and down we get a cluster depending on the workload that we need automatic backups are handled for us monitoring all of these things upgrades these are all sort of handled for us so that's neo4j aura however there is also a free tier in neo4j aura so that means we can spin up a neo4j aura instance without having to put in a credit card and we get a free free forever database so this is great for these sort of side projects i know like a lot of folks i have a lot of side projects going at the same time so this is great to be able to have a database in the cloud for my side projects so i'm already signed in to neo4j aura so i'm just going to create a database it gives me two options or a free or the professional so or a professional this is gives me a neo4j cluster with multiple instances i can choose to scale the size of it up and down but we are interested in or a free tier since this is going to be kind of a side project and the the theme i guess that i wanted to get across in the stream today is really no cost data scraping so when we get to using github actions we'll see that github actions also has a free tier so as long as we're staying within that threshold we can basically be able to scrape this data from the lobsters news aggregator import that into neo4j do some analysis build a web application whatever we end up doing with it [Music] all at no cost which i i think is pretty cool so okay we're going to choose our aura free let's give it a name let's call it how about lobster's graph and this is telling me the password so i'm going to copy this password and save it somewhere else because we will need in a minute and say yes i have saved that and now we are provisioning our aura instance great so that'll take a minute or two to spin up so while we're waiting for that let's talk about github actions so github actions uh this is a feature of github that allows us to trigger some code to run typically in response to a github event so if i have a repository on github and i make a commit someone opens a pr uh typically we use github actions to do something like run the tests or trigger a deployment or something like that and i can write my own code in github actions to define what i want to happen or i can use a lots of these pre-existing actions from the marketplace we can see here a whole bunch of different github actions that exist that other folks have built and published that we can leverage and we configure all this using yaml which is this markup language we are specifically today going to use this github action called flat data oh and by the way let me drop some links here in the chat so that's github actions here's the flat data and let me drop the link for aura as well okay cool so so what is flat data well there's this really interesting workflow that uh this guy named simon willison uh uncovered called dango project if you remember django uh currently he's been working on this project called dataset which allows you to share data sets and build apis using uh sql lite uh which is pretty cool but he came up with this thing he called get scraping which is basically okay if i have some some url that changes the data changes over time so in the example he's talking about in this blog post which let me also drop the link to that in the chat he was looking at fire data and of course as there's a new fires there's a new incident the data for that is changing and he wanted to see how this changed over time and so he wrote a scraper that would fetch data from this url and check it in to version control so check it in to get and then you can see all of the diffs for how that changed over time and there's a team at github the office of the cto called octo which i think is very fun uh they took this idea and built this action that's now available in the marketplace the give action marketplace called flat data and this is really cool because flat data allows us to basically just give a url for our file in in our case this is the url for this json file that we're going to be working with but it also works with a database so if i want to run a query against a database and download that we can also use that with flat data but allows us basically with just some configuration and pointing at this file say hey fetch this every hour every five minutes whatever fetch that check the file into version control so check it into git push it up to github and then i can look at the diff to see how that file changes over time and there's some other things in here that we can also do some post processing so we're going to use something like this feature to import this data every time we fetch the data we're going to import it into neo4j and there's also a visualization component that this team built as well cool so um that is the plan i guess maybe first of all should we start with the flat data scraping maybe that's a good place to start so i'm going to go to github and create a new repository and by the way if anyone has any questions or wants to chat feel free to drop a message in the chat i can see those on on the other screen here so let's create a new repository and this is going to be the repository where we're going to define our github action and where the data for the lobster articles is going to be checked in so let's call this how about um lobster graph and this is going to be importing data into neo4j aura using github actions and specifically data from lobsters make this public and let's add a readme so we can start committing to this right away i'm gonna try to use just the embedded editor in github we'll see how that goes okay so should we look at an example maybe let's look at an example for flat data so this is importing uh bitcoin price data and if we see here see this dot github workflows directory that has this flat.yaml file so let's zoom in on this so we can see that and we move this over here there we go so okay so here's this example so this yaml this is defining an action uh and we're defining the triggers for the action and then the steps for the action which is basically just to use uh the github flat action and then fetch this url and save it as this file and then run a post process so there's a javascript file that says after you fetch this data here's what you're going to run so let's create that for r so i'm just going to use this the editor in github we'll see how that goes so create a new file and we want this to be dot github workflows and let's call this [Music] lobsters.yml so we need to give this a name let's call this the lobsters data import and now here we're defining the triggers for this action so anytime there's a push um i guess specifically anytime there's a push to this this configuration file so anytime we're changing the configuration for the action uh let's let's run it so anytime there is a push to any of these pads and just going to be dot github workflows slash this file lobsters.yaml we can also manually trigger this so if we say workflow dispatch that allows us to go into the ui and click a button i think to manually trigger this uh and then here's here's the really cool thing is that we can put this on a cron schedule so we can say every uh maybe let's do what every hour that's probably often enough so that i think is the cron tab syntax for every hour um the the most rapid rate that we can do these on github is i think every five minutes um and by default with github you get i think it's 3 000 minutes of github actions on your account uh so think of in and of course you can and that that's for free and of course you can pay beyond that so we want to uh we want to keep our total minute of action running to less than three thousand per month unless we have uh you know we have a reason to increase the frequency there i have a bunch of other things running but i think once every hour should be good okay and now we're going to define the jobs for this action so let's call this scheduled and we can define the container that this runs on let's just do this on the latest ubuntu and then we define the steps for our action and so the first thing we'll do is check out the repo and this uses so this is going to use a specific action from the marketplace called checkout and that's just going to check out the repo because we want to be able to commit the data back we want to be able to access the data and so on so we need to check out the repo and then let's call this one fetch newest let me zoom in a little bit so this is hopefully a bit more readable there we go so fetch newest and this is going to use the github octo flat data action and here in this with scope we're going to define some variables specifically we need to specify the http url so this file that we want to download every hour is lobsters slash newest.json and then we need to say save that as newest.json okay so i think that should be enough here to fetch this file with flat data so let's commit this and if we go to the actions tab we can see yep here's this action that we created and it's starting to run here so first time this should run and fetch our file and check it into this repository and if we go back cool yep looks like it did so newest.json is now checked in to our repo cool okay so now the next thing we want to do is think about how we're going to get this into neo4j so let's jump over to neo4j aura and our ufj aura instance is ready to go so let's open up neo4j browser and i guess first of all let's um let's just take this file and think about okay we have this json how are we going to import that into neo4j so let's see how we can use the apoc so the standard library for neo4j let's see how we can use that to pull this json file into neo4j and see how we can create data in the graph for what we're interested in importing about lobsters is that it has an invite graph let's see if we can find this so lobsters you have to be invited to be able to submit stories and make comments on it and that is public everyone who has invited users uh yeah the full user tree and so this so this is public so we can see everyone who has been invited to lobsters who invited them uh and so on so this invite graph i think makes this uh a bit more of a friendly community i think uh since you basically everyone sort of knows everyone else to some degree um and this is this would be an interesting graph to to pull in i think we'll maybe look at that in the next session but that is what this invited by user is here so this is saying that this user jordy gh he was invited to lobsters by this user so we can pull that in as well and then each article has some tags so if we go through the graph modeling process here let's think about this well we're going to have a user and the user has a username that's a string created at date let me zoom in a bit hopefully that's big enough created at this is going to be a date time um is admin i don't know if we need that let's bring in the about that's a string user has karma so you earn karma as you post articles and other users vote up on it other users upvote your comments that is karma which is an integer we have an avatar url and then we said that a user is invited by another user so we'll model that this way so user is invited by another user and then the other entity we have are articles so articles user let's say article or submitted and then for the article what do we have here so an article has a id it's called short id let's we'll keep the the lobsters term so short id uh it has a created at that's a date time uh what else a url that's important uh a score that's important that's an integer and then each article has uh one or more tags so let's model that as well so a tag this is like some category for the article so in this one this article about prime numbers the tags are games and math so we're going to use tags as some way to figure out what this article is about and later on if we want to build recommendations with this data those tags are going to be really useful to see if we can figure out what users are interested in so a tag i tag i guess we'll just say that a tag has a name oh and then article let's say has tag cool so here's our graph model that we want to create based on the data that we have so let's jump back into neo4j browser now for aura uh let's oh i copied the wrong pass l so the url is lobsters slash newest dot json and we're going to say yield value so value is now this object that represents the parsed json object and let's just return that make sure that that works and we should get back this is what 25 rows yeah so we get back 25 records so this parsed json object and let's here let's slide this over here so we can zoom in a bit more so the font's a bit bigger maybe that's easier okay cool so this is i mean this is just the json uh data that we're looking at over here so now instead of just returning this let's start to create some [Music] data so what we want to do is for each one of these articles create some things in the graphs we're going to unwind this array object let's say as articles unwind value as articles now article refers to each one of these objects and first i guess let's merge on the user so merge if you're not familiar with the cipher command merge this is like a git or create or like an upsert so it allows us to say check to see if this pattern exists in the graph if it doesn't exist then create it if it already exists then just select it just match on it so this allows us to write these import scripts that are what's called item potent so this this is really important because we were ultimately we're going to be running this import script every hour so we want it to be item potent so that if the data hasn't changed then no data in the database changes and merge is really useful for allowing us to do that so we're going to merge on user we'll alias this user to s for the submitter i guess is a good way to think of that and in the merge pattern we want to merge on the property value that identifies uniqueness for the user which in this case is the username and that is here is going to be article dot submitter user dot username i'm going to zoom out just a little bit so we have a little a little longer actually let's let's move this up to the top that'll give us a bit more room there we go uh okay so merge on the user and then we're going to let's say oncreate set so only in the case where we are first creating the user do we set these values um actually we have some things that change here so karma can be changing so with a merge we have two options we can say merge and then set some values and we can do that if we if we just say like this merge and then set we're going to every single time that we execute this merge operation we're going to update these property values but we could also say oncreateset which means only in the case of the first time you're creating the user but i think some of these might change certainly karma can change a user can update their description they can change their avatar url so let's just set all of these each time so we'll say set created is going to be cast to a date time the article dot submitter user dot created at uh s dot karma let's bring that in it's going to be article dot submitter user dot karma so because we're using apoc load json here we don't have to cast this integer to an integer that's handled for us which is quite nice if we're using something like load csv then we would need to cast any data type that's not a string here we're passing the created at timestamp to the datetime function so that's converted to a datetime type in the database uh what else we have the about so the description submitter user dot about and avatar url that's a good one submitter user dot avatar url and let's add [Music] the lobster's domain in front of that so we have the full url you can see here it's just the path without the domain okay that's probably enough for the submitting user next let's merge on the user that invited this user so say merge this one is going to be i for invited article dot submitter user dot invited by user i'm just looking at our document over here and then we'll say this user invited by i oh i've got the backwards the submitter was invited by i the inviting user okay so that'll create this relationship between the inviting user and the submitting user if that doesn't exist so now let's do another merge on our article so what is the the property value that identifies you uniqueness for the article i guess that's going to be our short id it's the article dot short id and let's go ahead and update these again we could use oncreate sets but these values can change over time for the article i guess the score gets voted up um sometimes the title could be chained so let's just go ahead and update these each time so the article url we care about uh what else the score article title and there's a url for the comments that's important we might want to scrape the comment data later on um what else do we have uh oh the created at date uh so we'll pass that again to date time article dot created at okay so that's going to create the article node and then merge to say the submitter submitted the article to create that relationship and then the last thing we have here is the tags so we're going to unwind over this array this tags array and do another merge to create or find the tags if they already exist and connect the tags to the article so let's do it with here just to bring through the article which is json object and a which is the article node and we're going to unwind over article dot tags as tag and merge on the tag and then merge on this pattern that says the article has tag t okay let's give that a run see if that works uh nope syntax error on line 15. uh we forgot a comma on the previous line in our set let's try it again okay so we created 104 nodes set a bunch of properties created a bunch of relationships let's match on all our articles return them okay so here is an article if we double click we can start to traverse this so this is an article thoughts on racket and cheese scheme i don't know what that is a schema something to do with lisp okay cool so this article has a tag we can see other articles that have the same tag so other things about lisp you can see the users that submitted the article and who invited that user cool so i think that is what we wanted to create in the graph so we could go a couple of ways now we could maybe play with something like near j bloom to try to start to make sense of this visually we could look at start analyzing some of this data could we figure out recommendation algorithms but i think we need a bit more data and so what we want to do is set this up to scrape this data over time so we said every hour we want to import this data into neo4j so all we've done so far is write this single cipher statement that fetches this url this json file and then creates the data in the graph but we want to run this periodically and so this is where our github action comes in so we're going to go back to github let's copy this cypher statement this will be valuable for us in a moment copy that uh and now what i want to do let's see where is our repo here it is so now what i want to do is take a look at this post process option for flat data so one of the options here that we saw is is um we can run a javascript or a python file after our github action run so we set this up to run every hour to download the json data check it into github um we could write a script to using the like neo4j javascript driver to run our cipher statements load the json file that's been checked into github pass that json as a parameter uh connect it to neo4j and and execute that statement we could write a script to do that um but i already wrote a github action that i thought just abstracted this away and made this a bit easier that is in the github action marketplace so let's take a look at that so if we go to search the gib action marketplace i call this flat graph because it's it's designed to work with flat data the flat data github action and so the way this works is after we configure our github action using flat data to import the json file we can then use this flat graph action and all we have to do is define some credentials for our neo4j instance what file name we want to pull in and then the cipher query that we want to run and so what this will do is we'll load this json file pass it as a parameter to the cipher statement that we define connect to this neo4j instance and that will run every time that our github action runs in our case once every hour so just kind of takes away some of the boilerplate we could again just write a script to do that but that's just the the purpose of this flat graph uh github action so i'll drop a link to that as tori and then we can reference them using that syntax in our github actions later on so we want to create a few secrets here one for neo4j user which is neo4j one called neo4j uri which is our connection string for our aura instance which is right here and then one for our aura password which i will i will change this later so don't worry too much about this showing up that's fine we'll change that after the stream here okay cool so now i've got neo4j password to uri nerfjuser and if we go back to our yaml file let's edit this and we're going to add a new step uh let's call this step um maybe just neo4j import oh we need to give it a name so neo4j import and this is going to use the johnny montana so that's my github username slash flat graph version 1.2 so that's the flat graph github action that we were looking at a second ago and we need to define some values to pass to that so we need a neo4j user so we'll read that from secrets dot neo4j user we need a password password which we saved in secrets dot neo4j password and a neo4j uri which is our aura connection string which is in secrets dot neo4j uri i think we called that we also need to specify the file name which we said here is newest.json and we need our cipherquery so cipherquery this is a multi-line so there are a few different ways to define multi-line strings in yaml i think um this is one of them and i need to go back and copy this query that we wrote again so this query wrote again we're going to use pretty much all of this except for we don't need the apoc load json because we're reading it from a file and that is being passed as a parameter so we'll need to modify this just a little bit so here's our query let's indent this properly so yaml parses as one multi-line string so the only change that we should need to make to our cipher query is instead of value with apoc load json value is a variable um but in this case we are reading this newest.json file and passing that as an uh cipher parameter that's what the the flat graph action is doing for us behind the scenes so that we just need to change the syntax here to say this the dollar sign says that hey this is a cipher parameter so some value that's going to be passed with the query okay so let's before we save this so when we commit this file that will trigger the action to run and we'll start importing data into neo4j let's delete everything from our database so that we can actually verify that this actually worked so i'll say match a detach delete to delete everything and there's nothing in the database okay so i'm going to commit this change to our yaml file and now this should trigger our action to run so if we go to the actions tab we can see that yep it is running and what should happen is we should fetch the newest.json check that in to the repo and then we'll run this neo4j import step which should load this data into neo4j so now if we run this statement again we should see some data cool so looks like that worked so here's the 25 most recent articles submitted to lobsters including the tags and the users and then we're also seeing who invited those users cool and so this is now scheduled to run every hour if we go back and look at our yaml file uh yep it's cool so this will run every hour to update the data so now in the background we've set up this scraping infrastructure to import all of the most recent articles from lobsters save that in neo4j aura and once we have a bit more data we can start doing some interesting things like trying to analyze this data set maybe trying to generate some uh personalized recommendations uh let me drop a link to this repo that we just created if you want to uh see what we did and and again this is just an example that we used uh with the lobsters dataset i've been using this uh in the last couple weeks for a couple of different data sets that i've been interested in we can also do this for apis so if we have some api that we subscribe to like i've been looking at the new york times api so that has api keys we can also add api keys to our our secrets and use that in github actions as well i've also been looking at wildfire data since that is particularly of mind right now cool so i think that is probably enough to cover for today uh next time we'll next time we'll have a week's worth of data um so it should be a lot more interesting um i think what i'd like to dig into next is maybe some visualization aspects of this so can we build an interactive visualization to try to understand a bit more of this data set and of this network so that's what we'll do next time so i think we'll stick with the same thursday afternoon u.s time uh schedule i know it's been a bit of a hiatus for me on the live stream but uh hopefully we can get back to the regular weekly uh schedule cool well that's all i have for today so thanks so much for joining uh have a good rest of your day and we will see you next time cheers you
Subscribe To Will's Newsletter
Want to know when the next blog post or video is published? Subscribe now!