As a bot developer, you have to be wary of Captcha as they can prevent you from building working bots. Can they be prevented and solved programmatically? Yes, and continue reading this article to discover the best ways to bypass Captcha.
For regular Internet users, when their Internet surfing gets interrupted and the website they are on asks them to prove they are not robots by carrying out certain actions, they do not have an idea of what’s going on – and would, in some cases, get frustrated.
However, as a bot developer, you know you are the reason why this technique was introduced – it was as a result of the actions of your bots sending too many requests and accessing websites in automated ways.
If regular users are still being forced to solve Captcha, then you should know that your bot can’t escape it; you either learn how to prevent them from appearing or learn how to solve them when they appear.
As a bot developer, I have come to realize that it is best to even avoid them in the first place because some of them can be incredibly difficult to solve via automated means. I have had issues solving some Captcha manually – do you think I will get that done easily programmatically?
Even the best anti-Captcha systems with a huge team behind their development still find it difficult to solve some Captcha programmatically – they employ people to get that done and pay them. Because of this, our focus will be on preventing them from appearing in our bot.
What is Captcha?
Captcha is the acronym for the Completely Automated Public Turing test to tell Computers and Humans Apart. It is sometimes written in all caps as CAPTCHA. This is a type of challenge-response test developed to determine whether the user behind Internet traffic is human or machine (computer).
This technology was introduced into the Internet landscape in response to the actions of automation bots. These bots can be in any form – web scraper, crawler, spider, purchase bot, bulk account creation bot, and any other form of software that sends HTTP requests to web servers without using the official public API provided by the web server administrator.
These bots are known for sending too many requests to websites, which could either crash them or add to their running cost without being of benefit to the websites they access.
But no, this is not the only issues associated with bots; they can be used for gaining undue advantages when users are expected to carry out certain tasks within a limited time period and when the competition is high in the case of buying limited-edition sneakers, tickets, and other high on-demand items.
Bots also collect data from web pages without the permission of website owners. Because of these and many more, websites put technologies such as Captchas in place to discourage bot access.
Types of Captcha Used by Websites to Prevent Bot from Accessing Content
When people hear about Captcha, they think of it as only the “I’m not a robot” checkbox. However, there are a good number of Captcha that websites will use to determine the true source of a request. It is important you know about them so that you won’t be dealing with a Captcha problem and will be looking elsewhere. I will discuss a little about each of the Captcha types.
-
Image Captcha
Image Captcha is the most popular Captcha you will encounter on the Internet. It requires you to identify objects in images. Google’s reCaptcha provides one of the most effective Captcha services – however, this can frustrate even regular users. Image Captcha will letters in it are easy to use.
-
Word/Math Captcha
This type of Captcha will require you to solve some words or math problems to pass. An example is a Captcha that will require you to solve “3+5”. There are many forms this will appear.
-
Honeypots
These are not easy to discover. This is because they are hidden using CSS attributes from real users, but since bots download the full content, they can see them.
When a bot interacts with a honeypot, which could be a hidden field in a form or a link, it has inevitably reported itself to be a bot. You will have to consider CSS attribute and make sure you don’t interact with any element with the visibility turn off or hidden.
-
Invisible Captcha
Invisible Captcha can’t be seen. They work in the background and track behaviors to determine if requests coming from certain IPs are bot initiated. They are effective, but the effectiveness is still questionable as experienced developers can develop bots that can mimic regular users.
-
Social Media Sign in
These types of Captcha require you to sign in to your social media account. These ones are not popular as web admins are aware that Internet users will hesitate to do this.
-
Time-Tracking
How this type of Captcha work is simple, they simply track how fast you carry out certain actions such as filling a form and could tell if a bot just filled a form because of the speed at which bots operate.
Is My Bot Receiving Captcha?
If you are suspecting whether your bot is being interrupted by Captcha, you need to look at the response the web server returns. Does it have a Captcha inside it?
Sometimes, you will not even get Captcha retuned in the code; it might just be a constant timeout error while you can still visit the same page using your browser. It could also be that you will receive some form of 50x error.
Techniques to Avoid Captcha
You are most likely to encounter Captcha when filling forms online, sending too many requests typical of bots, or they just happen without you having an idea of what even triggered them. As I stated earlier, it is better to avoid them than to solve them. Follow the below techniques to avoid triggering Captcha.
Use Rotating Proxies
The number one way to avoid triggering Captcha is by using rotating proxies. Rotating proxies makes it difficult for websites to identify a recognizable IP footprint in the requests you send by hiding your real IP address and using other IP addresses – and rotating the IP assigned to your requests either at time intervals or after every request.
You can buy rotating proxies from Luminati, Smartproxy, Stormproxies, and Soax.
To be on a safer side, you can make use of proxy API, otherwise known as web scraping API.
Proxy APIs do not just rotate IP but can also solve Captcha if they appear.
Scraping API, ScrapingBee, and Crawlera are some of the best Proxy APIs in the market.
Rotate User-Agent and Take Note of Your Other Headers
It might interest you to know that websites allow a few bots they regard as good bots to access them, such as search engine spiders. Your bot is not one of the supported bots, and as such, you will have to hide your real identity by disguising your user-agent to that of a popular web browser or a supported bot.
Just changing user agent won’t work all the time; you will need to have a handful of user agent string and rotate them. It is also important you check headers sent by your browser and send them along in your bot too.
Randomize the Time between Requests
Bots are predictive, repetitive, and super-fast – and websites can use that against your bot. to safeguard your bot against triggering Captcha, I will advise you to randomize the timing between your requests.
It is also a good practice to set a delay between your requests to avoid overwhelming websites with requests – doing this is necessarily not to just avoid Captcha but to be polite to a website and avoid causing damage.
Avoid Honeypots
As stated earlier, some invisible elements can be introduced into a web page. These elements are not visible to users using browsers but visible to bots. By interacting with these elements, your bot is directly asking for attention.
It is important you check the CSS attributes of all elements you wish to interact with and make sure visibility is not turned off and display not set to hidden. Only when these two properties give you the green light should you go ahead and interact with an element. Fortunately, not all websites make use of this, but for websites that do, you will have to be careful.
Render All JS Codes
An overwhelming number of web scraper does not render JavaScript – they just send requests, download page in full, parse out required data, and the circle continues. Well, even if you are able to access all required data without rendering JavaScript, you will still need to render JS codes on some web pages to avoid triggering Captcha.
If you are faced with a website that will trigger Captcha until certain JS codes are rendered, you will need to find out the JS codes to be rendered and render them. This can be a lot of work to do. For this reason, I will advise you to make use of browser automation tools such as Selenium.
- Scrapy Vs. Beautifulsoup Vs. Selenium for Web Scraping
- How to scrape HTML from a website Using Javascript?
Avoid Using Direct Links
I must confess, I do use direct links until I am certain a website makes use of them to detect bots. Web administrators are aware that people do not just visit their pages; they are being referred from other pages. If a good number of direct link requests coming are coming in, a website will become defensive, and Captcha triggered.
It is advisable to visit other pages with the link you intend to visit on them or make use of the referrer header to deceive websites into thinking you were referred than sending just direct link requests.
How to Bypass and Solve Captcha
Sometimes, no matter what you do, you cannot avoid them. Take, for instance, some registration pages and other form-filling pages have reCaptcha just before the submit button, and you must solve them before you are able to submit the form.
In scenarios such as this, you can’t avoid them – you will have to solve them. Most likely, you wouldn’t want to solve them manually and will want it done automatically. How then will you do it? There are two options available to you – use Proxy API and Captcha solving services.
- Proxies for Preventing reCaptcha When Scraping Google
- craping Search Engines without Block and reCaptchas!
Use Proxy APIs
I stated above that the likes of Scraping API and ScrapingBee could help you avoid Captcha – this is because they also solve them in the background without you knowing.
If you know you are dealing with a website that you must encounter Captcha, you can go ahead and make use of either Scraping API or ScrapingBee as they can help you solve Captcha automatically – they are priced by successful requests and provide you proxies too.
Use Captcha Solving Services
An alternative way to solving Captcha is to make use of a Captcha solving service. These services make use of Artificial Intelligence, Machine Learning, and a host of other technologies and techniques to solve Captcha.
I will advise you to go for paid Captcha services as they are more effective. Some of the best Captcha solving services include 2Captcha, DeathbyCaptcha, and Anti-Captcha.
Bonus: Captcha Solvers for Browser Users
Even without making bots, you will most likely experience Captcha while surfing the web with your browser. This occurs when you refresh or carry out tasks too quickly. Sometimes, you will need to submit a lot of forms, and each of these forms has Captchas attached. If you are in this situation, then I will be recommending browser extensions for solving Captchas automatically.
AntiCaptcha Plugin
The AntiCaptcha plugin is provided by Anti-Captcha, one of the best Captcha solving service provider. This browser extension is available for Chrome, Firefox, and a host of other browsers.
With this extension, you can many types of Captchas, including ReCaptcha 2.0 and 3, FunCaptcha, image Captcha, hCaptcha, and Geetest, among others, this extension has been tested, and they have proven to work on a good number of websites, including Solve Media, FreeBitco.in, Omegle Chat, AliExpress, and even EA FIFA. While this extension works well, it is paid.
Rumola
Rumola is also one of the browser extensions you can use to solve Captcha. With Captcha, you will not have to worry about Captcha again as it automatically helps you solve Captcha as you load any page with Captcha on it.
This browser extension is only available as a Chrome extension. For non-Chrome users, you can make use of their Bookmarklet. Rumola has been developed to be used even for Internet users with visual impairment.
Conclusion
Make no mistake to think you can ignore Captcha when developing an automation bot that accesses web services they are not allowed to access as you will most likely encounter them.
Interestingly, with the right mindset and some techniques incorporated into the development of your bot, you can avoid triggering Captcha – these techniques have been discussed above. However, if you are in a situation where you must solve Captcha, then you can use either a Captcha solving service or proxy API to solve them.