Fetch Instagram profiles photos without API and without __a=1 parameter
Until now, scraping posts from an Instagram account was very easy: the only thing to do was add a __a=1
as a query string and you have the JSON ready to be read. Now it won't work anymore: this little RegExp will save you time.
In the most recent days, the Instagram team decided (probably after Cambridge Analytica?) to restrict their APIs, so the parameter is not more available, returning a 403 Forbidden
error.
So, what to do now? Should I create an app on Instagram? Access tokens? OAuth?
Nothing like that. The JSON of Instagram is always in the public profile of users, so we just need to get it with a simple RegExp.
Note: this works, of course, only for public profiles.
In the source of a profile, we can find the window._sharedData
variable which is a big object with a lot of links; those links are posts. The HTML is this:
<script type="text/javascript">window._sharedData = { ... };</script>
With a simple RegExp, we will be able to fetch its content and convert it into a readable object from our code.
Code
In this case, we'll use Axios as the HTTP library and Node.js as language (JavaScript) with async/await feature.
async function instagramPhotos() {
const userInfoSource = await Axios.get('https://www.instagram.com/theraloss/');
}
Now that we have the source, we need to write the RegExp. Since window._sharedData
is an object, we can then read it with JSON.parse
function.
async function instagramPhotos() {
const userInfoSource = await Axios.get('https://www.instagram.com/theraloss/');
// userInfoSource.data contains the HTML from Axios
const jsonObject = userInfoSource.data.match(/<script type="text\/javascript">window\._sharedData = (.*)<\/script>/)[1].slice(0, -1)
}
Here we do three things:
We use
Foo
as RegExp to get all the JSON;We retrieve the 2° item (
[1]
) of the returned matches from.match()
JS function, since the 1° is the whole contains also our "delimitator";We delete the last character of our object because, on the source page, it ends with
;
and, of course, it won't be a valid JSON.
Now that we have our JSON, we can easily read it.
async function instagramPhotos() {
const userInfoSource = await Axios.get('https://www.instagram.com/theraloss/');
// userInfoSource.data contains the HTML from Axios
const jsonObject = userInfoSource.data.match(/<script type="text\/javascript">window\._sharedData = (.*)<\/script>/)[1].slice(0, -1);
return JSON.parse(jsonObject);
}
That's it! Now you can access the most recents posts with the path userInfo.entry_data.ProfilePage[0].graphql.user.edge_owner_to_timeline_media.edges
. If, for example, we want to retrieve the most recent 10 images - excluding the videos - and store them in an array, we could write a code like this.
async function instagramPhotos() {
// It will contain our photos' links
const res = [];
try {
const userInfoSource = await Axios.get('https://www.instagram.com/theraloss/');
// userInfoSource.data contains the HTML from Axios
const jsonObject = userInfoSource.data.match(/<script type="text\/javascript">window\._sharedData = (.*)<\/script>/)[1].slice(0, -1);
const userInfo = JSON.parse(jsonObject);
// Retrieve only the first 10 results
const mediaArray = userInfo.entry_data.ProfilePage[0].graphql.user.edge_owner_to_timeline_media.edges.splice(0, 10);
for (let media of mediaArray) {
const node = media.node;
// Process only if is an image
if ((node.__typename && node.__typename !== 'GraphImage')) {
continue;
}
// Push the thumbnail src in the array
res.push(node.thumbnail_src);
}
} catch (e) {
console.error('Unable to retrieve photos. Reason: ' + e.toString());
}
return res;
}