Tutorial: How to scrape Instagram using Swift 👾

In this tutorial I’ll show you how incredibly easy it actually is to scrape user information from any instagram profile using swift without a third party library ✨

Watch the video tutorial on how to scrape instagram!

1. Define the url for the request
2. Create a task to fire the request
3. Determine the left and right side of your data
4. Extract the data
5. Final Notes

We are going to implement our scraper inside playground so go ahead and create a new playground 😊

First lets define a variable that holds the url of the website we are going to scrape and a variable to hold the username to then create a URL out of both:

import Foundationlet baseUrl = "http://www.instagram.com/"
let username = "martin_lasek"
let url = URL(string: baseUrl + username)!

I am force unwrapping here because we only get nil when the url is containing illegal characters and we can clearly see: it doesn’t 🤓

Next we will define a data task that will fire a request to our url and print the html as a string into our console:

import Foundationlet baseUrl = "http://www.instagram.com/"
let username = "martin_lasek"
let url = URL(string: baseUrl + username)!let task = URLSession.shared.dataTask(with: url) { (data, response, error) in guard let data = data else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("couldn't cast data into String")
return
}
print(htmlString)
}
task.resume()

Hitting the play button on the bottom of playground will run the code!

You can also execute your code with hitting shift + enter on the last line ✌🏻😊

Now when we have a look at the html in our console yeah it is a loooot of html hahah but try finding where the <body> starts and you’ll see that there is actually a javascript object holding all data for the profile! What a gold mine!

HTML Response of an Instagram Profile

A cryptic headline. What on earth is meant by left and right side of your data? Well well well Watson. I’m glad you asked 😏

Let me throw in some pictures with captions 😊

The Data We Are Going To Grab
HTML Response of an Instagram Profile

So data here is the piece of information that we want. But how do we grab it? Now we can’t just count the length of our htmlString and try figuring out at which index the follower count starts because where does it end? 5 indexes later? What if a profile has only 157 follower? Now the count ends 3 indexes later. You see that approach doesn’t really work out 🤔

But we know that the string left to the count won’t change and is quite unique as well. Of course only if we don’t use an even longer part of the left side than displayed in the screenshot above because then we are starting to take characters into account that aren’t keys but actual values which are subject to change and would then break our code.

So what we do is we look at the left side of “the data we want to grab” and decide how much of the left side we need to ensure it is unique but also won’t change when we request a different profile.

edge_followed_by":{"count":

No matter what you do make sure your left side really ends right before the data we want to grab. Like literally the next character of our left side that we are not taking into account is the first character of the data we want to grab.

Same goes for the right side it really has to start right after the data we want:

},"followed_by_viewer

Bear with me it all will make more sense when are going to code it 😊!

import Foundationlet baseUrl = "http://www.instagram.com/"
let username = "martin_lasek"
let url = URL(string: baseUrl + username)!let task = URLSession.shared.dataTask(with: url) { (data, response, error) in guard let data = data else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("couldn't cast data into String")
return
}
// print(htmlString) // commenting this out now let leftSideString = """
edge_followed_by":{"count":
"""
let rightSideString = """
},"followed_by_viewer
"""

}
task.resume()

We are using """ because our string includes single quotes " and I find it easier to look at using """ instead of escaping every " in our string 😊

Here comes the secret ingredient. The question you have asked yourself from the beginning. How are we going to extract the data? Regex? How? Tell me!

It’s not regex. It’s range. 🥳

Strings in swift have a powerful function called range(of:) which allows us to get the range of a string within another string like for example:

let name = "Link"
let message = "Link! Hey Listen!"
let rangeOfName = message.range(of: name)let startIndex = rangeOfName.lowerBound.encodeOffset
let endIndex = rangeOfName.upperBound.encodeOffset
print(startIndex) // 0
print(endIndex) // 4

Yes the upperBound index is 4. No worries this is perfectly fine and I will explain it a bit more in detail in the final notes. We’re not going to use indexes this was just an example for a better understanding about how ranges work 😊

NOTE:
lowerBound = correct index
upperBound = one index further

We are going to work with ranges because we can not only “get the range of a string within another string” but we can also “access a string from within another string using a range”. I am sure you know how to access a single character from within a string using an index right?

let character = name[2] // this would access the "n" out of name

Well let me tell you something awesome: same applies to ranges 🙌🏻✨

let string = message[rangeOfName] // gives you "Link" out of message

Alrighty let’s get back to our mission! So we have defined the left side of the data we want to grab as well as the right side of it 🤓

We are going to get the range of both sides to then create a whole new range that defines where our data is located within the htmlString:

import Foundationlet baseUrl = "http://www.instagram.com/"
let username = "martin_lasek"
let url = URL(string: baseUrl + username)!let task = URLSession.shared.dataTask(with: url) { (data, response, error) in guard let data = data else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("couldn't cast data into String")
return
}
print(htmlString) let leftSideString = """
edge_followed_by":{"count":
"""
let rightSideString = """
},"followed_by_viewer
"""
guard
let leftSideRange = htmlString.range(of: leftSideString)
else {
print("couldn't find left range")
return
}
guard
let rightSideRange = htmlString.range(of: rightSideString)
else {
print("couldn't find right range")
return
}
let rangeOfTheData = leftSideRange.upperBound..<rightSideRange.lowerBound let valueWeWantToGrab = htmlString[rangeOfTheData] print(valueWeWantToGrab) // prints the follower count: 19093
}
task.resume()

By using the left side and the right side of the data it doesn’t matter “how many indexes long” the actual data is. It doesn’t matter if someone has only 157 followers (3 indexes long) or 19093 followers (5 indexes long) Because the left side and right side strings themselves won’t change. They only might change their position within the htmlString. Like the start index of the right side might change. But we don’t really mind because we say “give me the range of that right side string within htmlString no matter where it is” and with that we are getting the right range (start/end index). And can build our new range that defines where within htmlString the data is located 🔥

Here’s a small visualization of the ranges of our left side and right side just in case it didn’t click completely for you 😊

  //            `leftSideString`
//
// edge_followed_by":{"count":
// <-------------------------->
// ↑ ↑
// lowerBound upperBound
// `rightSideString`
//
// },"followed_by_viewer
// <-------------------->
// ↑ ↑
// lowerBound upperBound

Remember the example with the name and message and that we had a range that had an index of 0 and 4 although name is only having indexes from 0–3:

"Link" // String
0123 // Its indexes

The lower bound was having the correct start index (0) but the upper bound had an index that was further by one (4 instead of 3).

Now when we created our new range we have used the upper bound (one index further) of the left side. And we have used the lower bound (correct start index) of the right side.

Remember the left side string ends exactly before our data starts. Meaning its upper bound is one index further and therefore the correct start index of our data which is how we need it! 😍

Also remember the right side string starts exactly after our data. Meaning it starts one index after our data. Well the upper bound of data has to be one index further of where data actually ends. Because that’s how the upper bounds work. And since a lower bound describes the exact index of where a string starts. We can use the lower bound of the right side which will give us the exact index where this right side starts. And where does the right side start? Exactly! One index after our data. That’s why we can use the lower bound of the right side as the upper bound of our data which is how we need it! 😍

Go to the last line right after task.resume() and hit shift + enter and you got it! You successfully implemented your first small instagram scraper 🎉!

Watch the video tutorial on how to scrape instagram!

I'm an always optimistic, open minded and knowledge seeking fullstack developer passionate about UI/UX and changing things for the better :)