How to parse proxy lists

What is a proxy? What are proxies for?

A proxy is a generic name for technologies, the main result of which is that another computer makes a request to a remote host for you, and then sends you the result. That is, a proxy is an intermediary.

Proxies have various uses, among them: anonymization (hiding the real IP), bypassing restrictions (blocked web sites, number of requests), increasing productivity (for example, by caching the transmitted data) and other uses.

Since in this article I will consider public (free) proxies, I would not advise using them for anonymization – there may be a Honeypot. Or they can be configured so that in fact your real IP is transmitted in the protocol headers.

That is, for example, to bypass the restrictions in non-critical situations, public proxies are suitable, but for something more serious, you should use other options. And without necessity, I would also not recommend using proxies – we do not know for what purpose they are created and what the proxy owner does with the transmitted data.

Free proxies

On the Internet you can find many lists of free proxies – entire specialized sites contain lists of proxies, often grouped on the basis of:

  • country
  • proxy protocol
  • availability
  • connection speed
  • response time

Some sites with proxy lists are quite user friendly and even allow you to download lists as a text file or in a table. Some, on the contrary, use obfuscation techniques to make parsing lists more difficult.

In this post I will show you how to parse IP with port numbers and save them to a text file. I will show you how to do this on the command line using programming. I will also give an example of code that shows how to use these lists in your program.

If you do not want to mess around with the command line, but want to receive such lists in one column (convenient for importing into programs), then I created a special service where you can download proxy lists in one column and without any extra data.

Proxy from hidemyna.me

I will review several sites – I found them through Google – picked up the first available sites on the first and second pages of search results. That is, I can’t say anything about the quality of the proxy – these are not some selected and good lists – these are just the first sites that came across. And the first one is hidemyna.me.

Lists for various criteria are available at: https://hidemyna.me/ru/proxy-list/

I will write code in PHP – it is convenient to use not only on the command line, but also to upload to the web server. If you prefer Bash as a scripting language, it is recommended to learch the regular expressions of the grep command.

To collect a proxy in the IP:PORT format, create the proxy_parser.php file and copy into it:

<?php

$link = 'https://free-proxy-list.net/';

$agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36';

$ch = curl_init($link);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response_data = curl_exec($ch);
if (curl_errno($ch) > 0) {
    die('Ошибка curl: ' . curl_error($ch));
}
curl_close($ch);

preg_match_all('#<td>[0-9.]{5,}[0-9]{2,}</td>#', $response_data, $rawlist);

$cleanedList = str_replace('</td><td>', ':', $rawlist[0]);
$cleanedList = str_replace('<td>', '', $cleanedList);
$cleanedList = str_replace('</td>', '', $cleanedList);

foreach ($cleanedList as $key => $value) {
    echo $value . PHP_EOL;
}

Run like this:

php proxy_parser.php

A list of 300 proxies will be received:

Also, note that the same proxies are copied to the $cleanedList array, in case you need to use them directly in the program (example below).

As I have already said, if you don’t want to mess around with the command line, there are always a current proxy list from hidemyna.me on this page: https://suip.biz/?act=proxy1

Proxy from spys.one

The next site is spys.one. On it lists are grouped by different criteria. If you are interested in certain countries, then edit the subsequent parsing code to fit your needs (most likely, just changing the page address is enough). I am interested in all proxies in a row from any country with any characteristics, so I will parse the page http://spys.one/en/free-proxy-list/.

The features of this site are that the port numbers in the source code are written in obfuscated form (unreadable for most command line tools). Another feature is that by default only 30 addresses are shown on one page. You can display a maximum of 500, but for this to the page you need to send the corresponding POST request.

To solve the problem with obfuscation, I will use PhantomJS.

To get a list of 500 proxies, create a proxy_parser.js file with the following contents:

"use strict";
var page = require('webpage').create(),
        server = 'http://spys.one/en/free-proxy-list/',
        data = 'xpp=5&xf1=0&xf2=0&xf4=0&xf5=0';

page.open(server, 'post', data, function (status) {
    if (status !== 'success') {
        console.log('Unable to post!');
    } else {
        console.log(page.plainText);
    }
    phantom.exit();
});

Run it like this:

phantomjs proxy_parser.js | grep -E -o '[0-9.]{7,}:[0-9]{2,}'

The cleared proxy list will be received:

Again, if you don’t want to mess around with the command line and/or install PhantomJS, then you will always find a fresh list of proxies obtained by the described method on the page: https://suip.biz/?act=proxy2

Proxy from gatherproxy.com

On the site http://www.gatherproxy.com/ we again encounter obfuscation – port numbers in hexadecimal format. The problem is solved with the PHP hexdec function.

Create a proxy_parser2.php file and copy into it:

<?php

$link = 'http://www.gatherproxy.com/';

$agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36';

$ch = curl_init($link);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response_data = curl_exec($ch);
if (curl_errno($ch) > 0) {
    die('Ошибка curl: ' . curl_error($ch));
}
curl_close($ch);

preg_match_all('#"PROXY_IP":"([0-9.]+)","PROXY_LAST_UPDATE":"[0-9. ]+","PROXY_PORT":"([A-Za-z0-9.]+)"#', $response_data, $rawlist);

foreach ($rawlist[1] as $key => $value) {
    echo $value . ":" . hexdec($rawlist[2][$key]) . PHP_EOL;
}

To filter the proxy, run it:

php proxy_parser2.php

Again, if you have Windows, then you can find a ready list here: https://suip.biz/?act=proxy3

Proxy sites that allow download lists

On some sites, the proxy lists are already in a convenient format without all the extra. An example of such a site:

How to use a proxy in your program (PHP)

The following is a sample PHP code that parses the proxy list and then starts the program using the values obtained. If the result is not received (it does not matter for what reasons – the non-working proxy or the remote host has banned the address of this proxy), then the transition to the next pair of IP:PORT is performed and the request is made again to the remote host. If it fails again, then everything repeats again until the desired result is obtained.

If all proxy addresses are exhausted, the new list is parsed.

Code:

<?php

//Set the proxy list counter to 0
$proxy_counter = 0;

//The function of parsing the proxy list
function getProxy() {
    global $proxy_counter;

    $link = 'https://free-proxy-list.net/';

    $agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36';

    $ch = curl_init($link);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $response_data = curl_exec($ch);
    if (curl_errno($ch) > 0) {
        die('Ошибка curl: ' . curl_error($ch));
    }
    curl_close($ch);

    preg_match_all('#<td>[0-9.]{5,}[0-9]{2,}</td>#', $response_data, $rawlist);

    $cleanedList = str_replace('</td><td>', ':', $rawlist[0]);
    $cleanedList = str_replace('<td>', '', $cleanedList);
    $cleanedList = str_replace('</td>', '', $cleanedList);

    //Reset the counter if this is not the first proxy list
    $proxy_counter = 0;
    //We report that the list is compiled
    echo 'Составили новый список прокси' . PHP_EOL . PHP_EOL;
    //Return the parsed list
    return $cleanedList;
}

//Get the initial list of 300 proxies
$proxy = getProxy();

//A function in which a certain proxy program is called
function doIt($url) {
    // These variables must be within the scope of the function.
    global $proxy;
    global $proxy_counter;

    //Some command that uses a proxy - in this case cURL is taken as an example with the --proxy option
    //In this case, $proxy[$proxy_counter] at the initial stage corresponds to $proxy[0],
    // that is, the first pair of IP:PORT
    $command = "curl --proxy $proxy[$proxy_counter] $url";
    
    //We send a command to execute to the system 
    //The output of the function is saved to the $output array.
    exec($command, $output);
    
    //If the result is obtained and the function output is not empty, then this section is skipped.
    if (empty($output[0])) {
        do {
            //If the result is not received, then we get here
            //Go to the next proxy in the list
            $proxy_counter++;
            //Reassemble the command with the already new proxy
            $command = "curl --proxy $proxy[$proxy_counter] $url";
            //Sent for execution
            exec($command, $output);
            //By the way, we have only 300 proxies, so if we reach the last value,
            // then we re-create the list
            if ($proxy_counter == 299) {
                $proxy = getProxy();
            }
        //When the function is running, the results are saved to the subsequent elements of the array,
        // that is, in $output[1], $output[2], $output[3], and so on.
        //We check if the very last element of the array is empty,
        //which contains the result of the last command call.
        //If the result is empty, then go to the new circle.
        } while (empty(end(array_values($output))));
    }
    //As soon as the result of the program is not empty,
    //this function returns a value and terminates
    return end(array_values($output));
}

//Function call
doIt('https://site.ru');

The logic of the program is to use a working proxy until it is banned by the remote host. I experimented: I did every new request with a new proxy (to delay blocking by IP), but the results for my purposes turned out to be much worse. However, if you have a high-quality list of proxies, most of which are not banned on the remote host, then you can make each new request with a new proxy. To do this immediately after the first call:

exec($command, $output);

Add lines:

$proxy_counter++;
if ($proxy_counter == 299) {
    $proxy = getProxy();
}

That is, there will be a change of proxy even if the request was successful.

Conclusion

Perhaps, I already have enough proxy. If you want to share other sites with good proxy lists, write them in the comments. If you want me to consider how to parse them and add the cleared lists to SuIP.biz, then indicate this.

By the way, about the proxy, we have not finished yet – in one of the following articles I will show how to turn virtual hosting into your own proxy server.

Recommended for you:

Leave a Reply

Your email address will not be published. Required fields are marked *