This post discusses how to automate “scraping” of the Old School RuneScape Wiki website (referred to as the OSRS Wiki in the remainder of this post). In my last post Scraping the Old School RuneScape (OSRS) Wiki, we covered the basic principles of the MediaWiki API and how it relates to the OSRS Wiki. We also covered the structure of API URLs and provided a list of useful examples specifically for the OSRS Wiki.
This post continues the journey of the OSRS Wiki API. This time, we are looking at automating extraction of information from the OSRS Wiki using simple Python programs.
As outlined, this post discusses extracting data from the OSRS Wiki. Since we are using their service and data/content, it is important to take two specific things into consideration:
It is most likely that you do not want to manually extract data from the API - that is, by entering in URLs in your browser and copying/pasting the results! That really goes against the principles of API usage - as they allow automation of tasks using computer programs.
To start with, I think it is important to give a very simple API query example implemented in a computer program. The following example is written in the Python programming language, as it is probably the most simple language to implement this task.
This will be a very brief section - to run the programs in this post requires a little programming experience. To start, you obviously need to install the Python programming language. If you are a Linux user, it is likely that your Linux distribution already has Python. If you are on Windows, you will need to download Python from the official website.
After installation of Python, I would recommend installing the
requests package. Although there are other options in the standard Python library (e.g.,
urllib), I find the
requests package easier to use. You can install the package using the Python package manager
pip. Use the following command:
If you have a different version of Python installed (e.g., Python 3.5) you will need to specify a different
pip executable (e.g.,
pip3.7). If you are on Windows, just add
.exe to the end of the
pip command (e.g.,
pip3.7.exe). An easy way to find your
pip version, is type
pip then press the
Tab key twice to list the executables available. If a
pip executable is not found using this method, it might not be installed on your system, or
pip is not available in your environment variables path.
The API query we are going to implement in our program is to query the categories available on the OSRS Wiki (have a look at Scraping the Old School RuneScape (OSRS) Wiki Part 1 if you want further information about this topic). The API URL we are going to implement is listed below - make sure to try it out in a web browser to get an idea of the results.
When you enter the URL into your browser, you can get an idea of what the returned data will be structured. Below is an example of the query being run in the Mozilla Firefox browser.
We are now going to start writing our Python program. Open your preferred text editor or Python IDE and enter in the following line:
This line imports the
requests library. After the import, we can use this library in our program. For example, to request web resources.
Next, we will add a custom user agent. A user agent is a request header that contains unique characteristics of who or what is sending an HTTP request. For example, most web browsers have a custom agent. For example, an example of a Mozilla Firefox user agent value is:
Check out the Mozilla Development Network article on User Agents if you want more information. So what is the point of adding a custom user agent? It allows you to send metadata about youself when making an API request. Adding a custom user agent is a very polite thing to do for the OSRS Wiki admins, as it lets them know who is accessing their API. If there is a problem with your API requests (e.g., frequency which increases server load and consumes too much bandwidth), they have your contact information. I usually include my email address and an identifiable user agent string.
The code snippet below displays a Python dictionary that is populated with the
From values. This dictionary object is later used when making the API request. In the following example, the dictionary variable is named
The next step is to construct the parameters of the query. In the last post, I discussed adding different parameters which allow us to specify what we want to request. For example, the type of action to perform (e.g., a query) and the format of the returned query (e.g., JSON). Below is an example of a query action that requests all categories available on the OSRS Wiki. This information is also stored in a Python dictionary, in this case, the dictionary is named
We have now finished constructing all the requirements for making the API call. The final thing we need to do is actually query the API. The following line performs the actual API request.
requests.get method is specified as we want to perform an HTTP GET request… we want to get data from the API! Note how we specify the base URL of the OSRS Wiki API. The second thing we include in the method call is our custom user agent, where we set the HTTP headers value (
headers) to include our user agent dictionary variable named
custom_agent_. The final thing we include in the method call is an inclusion of the parameters (
params) we want to use which is in another dictionary variable named
One final note. We save the returned value from the HTTP GET request in an object named
result. Since we specified (in our parameters) that we wanted JSON returned, the
results object will be a dictionary data type (a Python dictionary is pretty much the same as JSON in Python). We can easily see the information returned from this API request by printing the
Sometimes it can be hard to piece together code snippets. So the full Python program discussed in the previous section has been provided below:
Continuing on from where we left off in the last section… we are starting with the last example API URL. We discussed that wikis are usually organized into categories and each page is tagged with specific categories. This allows a specific page, or topic, to be associated with a variety of different categories. The URL we investigated listed the first 500 categories present on the OSRS Wiki, as documented below.
If you paste this URL into your web browser you will see a list of 500 wiki categories in JSON format. But how can you access the next 500? Basically, if you want all the categories and not just the first 500, you will need to perform multiple queries. This is because there are more than 500 categories on the OSRS Wiki. A good method to continue the query from where it left off if to use a generator or list and loop API requests to pick up from where the last query left off. Providing an example makes this premise much easier to understand. As an example, the query performed by the extract all categories URL (listed above) will have an entry which will look like the code snippet below:
As you can see there is a
continue entry which documents the next entry in the query. In this case, the
allcategories query has an option called
accontinue that can be used for the second API query call. It is similar to marking the page of a book you are reading. If you pick the book up at a later date, you know exactly what page you had gotten up to. It is difficult to manually continue a query, and it is recommended to use simple programs to leverage generators or lists to perform this task.
The listing below shows an example of a Python program that is specifically written to extract all of the OSRS Wiki categories without any user input. The program is listed below.
You can also check out some Python tools I have written to extract data from the OSRS Wiki for my OSRS Item Database project (osrsbox-db). They are available on my OSRSBox GitHub database repository. Until next time… happy scaping everyone!