codersteps logo

Web Scraping magnet links from a website with cheerio and curl

Abdessamad Ely
Abdessamad Ely
Software Engineer
Reading time: 11 min read  •  Published: in Node.js  •  Updated:

Notice: Before we start I just want to say that this article is purely for educational purposes, please use it responsibly at your own risk.

Introduction

1337x.to is a popular torrents website, with multiple categories: movies, music, and so on.

The goal of this article is to automate downloading torrents on a page like trending music, top-100 movies, and so on.

To download a torrent from 1337x.to we need to go to the torrent details page, then click the magnet link, which will then be opened with your default torrent downloader.

Imagine you want to download 50 torrents on a page, you will have to visit each page separately, then download the files by clicking on the magnet link on that page.

I will focus on explaining the principles used to scrape the links from the 1337x.to, so you will be able to reuse the same technologies and adapt them to your use case.

Setting up the project

In this article, we will use Node.js with TypeScript support, even though it’s a small project I like to use TypeScript just for fun.

Get your terminal ready, and change your current directory to the location you want to store your projects.

We will create a new directory with mkdir javascript_extract_torrents_magnet_links, then change to it with cd javascript_extract_torrents_magnet_links.

Let's initialize our git repository git init, our npm repository npm init -y, and our tsconfig.json by installing TypeScript npm install typescript --save-dev then npx tsc --init.

This is the tsconfig.json file after removing comments, and adding the "outDir": ".out" property.

tsconfig.json
{
  "compilerOptions": {
    "outDir": ".out",
    "target": "es2016",
    "module": "commonjs",
    "esModuleInterop": true,
    "forceConsistentCasingInFileNames": true,
    "strict": true,
    "skipLibCheck": true
  }
}

When using Node.js with TypeSctipt it's helpful to install @types/node with npm install @types/node@14 --save-dev please specify your own version instead of 14.

Let's start by creating a new file at src/app.ts with a simple async function that logs hello world to the console.

The app function will be the main starting point of our application.

src/app.ts
async function app() {
  console.log("hello world");
}

app();

To execute it, we will add an npm script called fetch to compile and execute our app.ts file.

package.json
{
    ...
    "scripts": {
      "fetch": "tsc && node .out/app.js"
    },
    ...
}

Because we use git, we need to add a .gitignore

.gitignore
/.out
/node_modules

To test that everything is working as expected we will run npm run fetch we should see a hello world message.

It's time to save our change to our git repository with git add . then git commit -m "initial commit".

As we said before, there are two types of pages we will scrape, a listing page that contains multiple links to detail pages, and a single detail page.

Let's first install both dependencies with npm install cheerio node-libcurl and npm install @types/cheerio --save-dev.

The main idea behind web scraping is fetching the HTML content for a page, then extracting the required information in an automatic way.

The same principle applies here, this is the structure of links on the "/popular-music" page.

/popular-music
<table class="...">
  <thead>
    ...
  </thead>
  <tbody>
    <tr>
      <td class="...">
        <a href="..." class="icon"><i class="flaticon-hd"></i></a>
        <a href="the-link-we-are-looking-for">Torrent title</a>
        <span class="..."><i class="..."></i>1</span>
      </td>
      ...
    </tr>
  </tbody>
</table>

To extract the links we're looking for, we will need a css selector to use with cheerio to query the exact anchor elements of our choice.

To make it readable and clean we will create a new object selectors which will store all of our css selectors, this becomes handy in larger web scraping projects.

It's always easier to explain after the code is already written, let's do just that and then explain what's going on.

src/app.ts
import * as cheerio from "cheerio";
import { curly } from "node-libcurl";

const config = {
  baseUrl: "https://1337x.to",
  selectors: {
    anchorLink:
      'table > tbody> tr > td:first-child > a:last-child[href^="/torrent/"]',
    magnetAnchorLink: 'ul>li>a[onclick="javascript: count(this);"]',
  },
};

async function get(path: string): Promise<string> {
  const result = await curly.get(`${config.baseUrl}${path}`);
  if (result.statusCode !== 200) {
    throw `error with status code: ${result.statusCode}`;
  }
  return result.data;
}

....

Because this project is not that big we will keep all code in a single file, but please feel free to move any helper function to a helpers.ts file, and wrappers for packages like our get function to lib.ts.

The important thing is to be comfortable with your file structure, so it's easy for you to modify or add new features.

As you can see we have declared a new object config which stores our app configurations, we have a baseUrl which is the homepage of the website we are scraping.

And the selectors config which includes the css selectors we need to extract data from the scrapped HTML content.

Then we created a wrapper function around our curly function from the node-libcurl npm package.

This way it's easy to use and without unnecessary duplication, it will also give us the possibility to easily replace the current package with another one without having to do changes all over the place.

The get function will simply try to fetch the HTML content from the page path, by concatenating it with the baseUrl, it will return the HTML in case of success otherwise it will just throw an error.

src/app.ts
...

async function app(path: string) {
  const html = await get(path);
  const $ = cheerio.load(html);

  const anchors = Array.from($(config.selectors.anchorLink));
  const hrefs = anchors.map((a) => $(a).attr("href")) as string[];

  console.log({ hrefs });
}

app("/popular-music");

Inside our main app function, we accept the path of the page we want to scrape, we use the get function to fetch the page HTML content.

Then we will create a dom instance using cheerio.load passing to it the HTML content we got back from curl. cheerio.load returns something similar to jQuery but on the server.

Now it's time to extract some data, using our anchorLink selector we will select all the anchor elements related to the torrent detail pages.

This will return something called cheerio.Cheerio which contains the elements that match our selector, used with Array.from we get an array of cheerio.Element[].

You can imagine the cheerio.Element as an HTMLAnchorElement and object with the properties we need like the href property in our case.

Using the Array.prototype.map() we map the cheerio.Element[] to elements href attribute string[].

At this point, if you run the fetch script with npm run fetch, you should see an array of the href attribute values logged to your console.

Now that we have the list of relative links to all torrent detail pages, we can scrape each page and extract the torrent magnet link needed to download the torrent files.

Just after where we map elements to links, we will add the following.

src/app.ts
  ...
  const hrefs = anchors.map((a) => $(a).attr("href"));

  let fetched = 1;
  const magnets: string[] = [];
  for (const href of hrefs) {
    const html = await get(href);
    const $ = cheerio.load(html);

    const magnet = $(config.selectors.magnetAnchorLink).attr("href") as string;
    magnets.push(magnet);

    console.log("magnet fetched: ", fetched);
    fetched++;
  }

  console.log({ magnets });
  ...

We loop through the hrefs array from the previous step, and for each href we do a similar thing as we did before.

We fetch the page HTML content using the get function, then we create a cheerio instance with the cheerio.load function.

Here because we only have a single link per page, we don't need to map anything, we just access the href attribute and push it to the array magnets.

We also declare the fetched variable which starts at 1 and increments by 1 each time a page was scrapped. We then console.log it to have some feedback.

At this point, if you run npm run fetch again, you should see an array of the magnet links we extracted from each torrent detail page.

The purpose of doing all of this was to be able to automate downloading torrents through extracting magnet links and using them in an automatic fashion.

In this section, we will have our app generate a simple HTML page with an anchor link to each magnet link.

The idea is to create a public/index.html file, populate it with the links we extracted, then server using the live-server package.

When this page loads we will have it click on the links every 3 seconds, at first you will need to allow the page to open links (you will get a notice that links were blocked, so you need to allow all).

By default when you click a magnet link on your browser you get a popup asking you if you want to open it with your default torrent downloader or cancel.

What you want to do is check the Always allow site_domain to open links ... and click Open torrent_app_name this will allow our script to work as we intended to use it.

The writePublicIndex function definition

Enough theory let's write some code, we will use fs/promises to create and put content to our file by creating a new writePublicIndex function that takes the scrapped magnet links.

Just after the get function add the following.

src/app.ts
// imports
import path from 'path'
import fs from 'fs/promises'

...

async function writePublicIndex(magnets: string[]) {
  const indexPath = path.resolve("./public/index.html");

  const html = ['<div>'];
  for (const magnet of magnets) {
    html.push(`
  <a href="${magnet}" target="_blank">Link</a>`);
  }
  html.push(`
  <script>
    window.onload = async (event) => {
      let executed = 1
      for (let anchor of Array.from(document.getElementsByTagName('a'))) {
        await new Promise((res, rej) => {
          setTimeout(() => {
            anchor.click()
            console.log('magnet executed: ', executed)
            res()
          }, 3000)
        })
        executed++
      }
    }
  </script>
</div>
  `);

  try {
    await fs.writeFile(indexPath, html.join(""), {
      encoding: "utf-8",
    });
  } catch (e: any) {
    console.log(`Error while writing the ${indexPath}: ${e.message}`);
  }
}

...

For the writePublicIndex function to work properly we need to manually create the public/index.html file and leave it empty.

At first, we declare and set the indexPath constant with the help of the path.resolve function, which enables us to get an absolute path from a relative one.

Then we declare an html array, which we use to construct our HTML page, first we push a <div> to it, then we use the array map prototype to map our magnet links to HTML anchor tags.

After that, we add a script that will run after the page has loaded, the script's purpose is to click on each link on the page with a delay of 3 seconds.

After that, we just close our </div> tag, and we try to write the content to the public/index.html file, if an error occurs like the file does not exist, we just log the error message to our terminal.

The writePublicIndex function usage

After creating the writePublicIndex function, it's time to make use of in order to populate our public/index.html with the magnet links we scrapped in the previous section.

Locate where we logged the magnets array to the console like so console.log({ magnets }); change it with await writePublicIndex(magnets);.

src/app.ts
...

async function writePublicIndex(magnets: string[]) {
  ...
}

async function app(path: string) {
  ...

  for (const href of hrefs) {
    ...
  }

  await writePublicIndex(magnets);
}

app("/cat/Games/1/");

Now, if we run our fetch script again it should populate our public/index.html file with links to all the magnet links we scrapped.

Making the torrent page path dynamic

This automation wouldn’t be complete without making the page path dynamic, in our example we use a static path like "/cat/Games/1/" or "/popular-music".

But what about passing it as an argument from the terminal, and maybe having a default page path in case no page was provided.

It's pretty straightforward, we will use the process.argv object to extract the first argument we provide to our app.

At the end of the src/app.ts file, let's change this app("/cat/Games/1/"); with the following.

src/app.ts
...

const pagePath = process.argv.length >= 3 ? process.argv[2] : "/popular-music";
app(pagePath);

The first argument we pass, the third in the process.argv as the two args are reserved for the node binary path, and the script file path.

So we check if we have more than 3 args, if so we take the third argument otherwise we use the "/popular-music" as the default page path.

Serving our page with live-server

Let's first install it as a dev dependency using npm install live-server --save-dev this package will help us quickly serve our public directory which will serve by default the index.html file.

Cause we're using typescript, we also need to install its types by npm i --save-dev @types/live-server.

So we will have a single npm script that does everything for us, from fetching and scraping the magnet links to serving them and opening them with our default torrent downloader.

After the installation is complete, we will create a new function servePublicDir which will do what we explained above.

Just before our app function, we will define our function as shown below.

src/app.ts
// imports
import liveServer from "live-server";

...

async function servePublicDir() {
  const params = {
    root: "public",
    open: false,
  };
  liveServer.start(params);
}

Nothing much to explain here, its self-explanatory we use the start method on the liveServer object we imported.

Now we just need to use it, after the writePublicIndex completed writing to our public/index.html file.

I also added some exception handling for our application, so we won't serve the page if the scraping wasn’t successful.

This is the app function content, after all the modifications we did.

src/app.ts
...

async function app(path: string) {
  try {
    const html = await get(path);
    const $ = cheerio.load(html);

    const anchors = Array.from($(config.selectors.anchorLink));
    const hrefs = anchors.map((a) => $(a).attr("href")) as string[];

    let fetched = 1;
    const magnets: string[] = [];
    for (const href of hrefs) {
      const html = await get(href);
      const $ = cheerio.load(html);

      const magnet = $(config.selectors.magnetAnchorLink).attr(
        "href"
      ) as string;
      magnets.push(magnet);

      console.log("magnet fetched: ", fetched);
      fetched++;
    }

    await writePublicIndex(magnets);
    servePublicDir();
  } catch (e: any) {
    console.error(e);
  }
}

const pagePath = process.argv.length >= 3 ? process.argv[2] : "/popular-music";
app(pagePath);

Now you can use it with something like npm run fetch /cat/Music/1/ and it should scrape all the torrent magnet links for the /cat/Music/1/ page.

Conclusion

In this article, you have learned about how to use a combination of technologies such as cheerio and curl to create a web scraping application with Node.js and TypeScript.

We have used the 1337x popular torrent website as a practical example, but the same principles can be applied to other kinds of websites.

Source code is available at: https://github.com/codersteps/javascript_extract_torrents_magnet_links.

That's it for this one, I hope it was helpful, see you in the next one 😉.

Abdessamad Ely
Abdessamad Ely
Software Engineer

A web developer from Morocco Morocco flag , and the founder of coderstep.com a website that help developers grow and improve their skills with a set of step-by-step articles that just works.

Your Questions

  • No questions have been asked yet.

Submit your question

Please use the form bellow to submit your question, and it will be include in the list asap.