Read our blogs, tips and tutorials
Try our exercises or test your skills
Watch our tutorial videos or shorts
Take a self-paced course
Read our recent newsletters
License our courseware
Book expert consultancy
Buy our publications
Get help in using our site
547 attributed reviews in the last 3 years
Refreshingly small course sizes
Outstandingly good courseware
Whizzy online classrooms
Wise Owl trainers only (no freelancers)
Almost no cancellations
We have genuine integrity
We invoice after training
Review 30+ years of Wise Owl
View our top 100 clients
Search our website
We also send out useful tips in a monthly email newsletter ...
Extracting external links from websites using Power BI |
---|
Did you know that you could use Power BI as a tool to scrape websites, using the Html.Table M function? We didn't either until now! |
In this blog
Did you know that you can scrape websites using Power BI Desktop?
The start of a list of all of the external links in a couple of websites.
It's amazing what we learnt while writing our new Advanced Power BI (Data) course!
The example below uses some pages from a publicly available demo website, but it would be much more fun to get a list of the external sites linked to by websites controlled by your employer, your competitors or your friends!
The first thing to do is to create a list of websites to visit in some form. Here's the one we'll use:
If you're short of time, you could always download the above Excel workbook, allowing you to follow along.
Load this into Power BI:
The table as it appears in Power Query, after promoting the header row.
Note that you may need to confirm your credentials and privacy settings to load this data.
Now add a custom column in Power Query to get the HTML data for each website:
The tool to add a custom column.
You can use the Web.BrowserContents M function to return the HTML from each web page:
This will give the HTML for each website visited.
This should return something like this:
The HTML for each page - it may take a short while for this to appear, as Power Query is having to visit each site.
You can now create a new column to use the Html.Table function to parse this HTML to extract just the HTML a tags (ie the hyperlinks):
See below for more on the syntax of the HTML.Table function.
You can then expand the results:
Click on the icon as shown above to expand the table of results for each page.
For our example, here's the start of what this would give:
The start of the links for each of our 3 web pages.
The Html.Table function takes two main arguments:
Argument | What it contains | For our example |
---|---|---|
1 | The name for the column you'll create | "Page links" |
2 | A JQuery-style selector for what to show | "a" |
If you don't know JQuery you'll still be able to use and make sense of the following, but if you don't know HTML at all you may struggle!
Change the second argument to the following to show only those links whose href attribute begins with the letters http (ie the external links):
"a[href^=""http""]"
This should give the following for our example:
You'd normally expect to see a host of different external links against each web page, but these are demo sites and so not typical (the first page doesn't have any external links at all!).
You can download the Power BI report containing this here (I've replaced the Excel data source for this to be a table of data, so the report should work as it stands).
Some other pages relevant to the above blog include:
Kingsmoor House
Railway Street
GLOSSOP
SK13 2AA
Landmark Offices
99 Bishopsgate
LONDON
EC2M 3XD
Holiday Inn
25 Aytoun Street
MANCHESTER
M1 3AE
© Wise Owl Business Solutions Ltd 2024. All Rights Reserved.