The Deep Web, in other words the invisible web or the hidden web, are parts of the World Wide Web whose content is not indexed by ordinary search engines.
The Deep Web Opposite phrase is a surface web that can be accessed by anyone using the Internet. Michael Bergman, a computer scientist, first coined the term in 2001 as a search engine listing term.
Deep web content is hidden behind HTTP forms and covers common applications such as email services, internet banking, and services that users have to pay to use. These services are protected by payment walls. Examples of these services are favorite video watching sites or some online magazines or newspapers.
Deep web content can be found using direct URLs or IP addresses. For full access to this type of content, you may need a password or other type of access permission after passing through the public page of the website.
Deep Web Terminology
The first combination of the terms Deep Web, Dark Web, and Dark Web (Dark Web) was introduced in 2009 when the terminology of terms used in deep web searches was discussed alongside illegal activities on the Free Web and the Dark Web. , Happened.
Since the introduction of the Silk Road Internet marketplace in the media, many people and the media have resorted to using the term deep web equivalent to the terms dark web or dark net. Of course, some people find this equivalent use of words inaccurate, and this has become a major source of confusion. Wired magazine reporters Kim Zetter and Andy Greenberg have suggested that each of these terms be used separately. While the deep web is any site that is not accessible through ordinary search engines, the dark web is part of the deep web that is deliberately hidden and not accessible through normal browsers and methods. .
In an article on the Deep Web published in the Journal of Electronic Publishing, Bergman noted that in 1994, Jill Al-Surat used the term invisible web to refer to websites that are not listed in any search engine. They were. In this article, Bergman cites another article written in January 1996 by Frank Garcia:
“It means a site that is probably well designed, but the designers did not bother to register it in search engines. So no one can find them! You are completely secretive! I call this the invisible web mode. “
Another was the use of the term “invisible web” by Bruce Montt and Matthew Cole of Personal Library Software. In 1996, they used the term in a press release to describe the first deep web tool.
The first use of the term deep web-specific term, which is now widely accepted, occurred in Bergman’s 2001 study.
Methods that prevent web pages from being indexed by regular search engines fall into one or more of the following categories:
- Background Web: Pages with different content for different access levels (ie different ranges of IP addresses or a set of previously visited sites)
- Dynamic Content: Dynamic pages that are usually accessible in response to a registered request or through a form, especially if open domain input elements (such as text spaces) are used. It is difficult to fill such gaps without sufficient knowledge in the field.
- Restricted Access Content: Sites that have access to their pages by technical means (standard excluding bots, captches, or non-save commands, which prevents search engines from searching the site pages and making cached copies) Restrict.
- Non-HTML or non-textual content : Textual content that is encrypted in multimedia files (photos or videos) or special format files that search engines have nothing to do with.
- Private Web: Sites that require registration and login (encrypted resources)
- Software: Some content is intentionally hidden from the Internet and can only be accessed through certain software such as Tor, I2P and other Darknet software. Tor, for example, allows users to access websites anonymously through .onion servers by hiding their IPs.
- Unlinked Content: Pages that have no links to other pages may prevent web crawlers from accessing the content. This type of content is called non-backlink pages (also known as internal links). Also, search engines do not always recognize all the backlinks on search pages.
- Web archives: Web archiving services such as Wayback Machine allow users to view archived versions of web pages over time. These archived versions include websites that are currently inaccessible or not indexed by search engines such as Google.
Content on Deep Web
Although it is not always possible to view the contents of a web server directly so that it can be indexed, it can probably be accessed indirectly (due to computer vulnerabilities).
Search engines use web crawlers to find content on specific web ports that track existing links through specific virtual ports. This method is ideal for finding content on the surface web but usually does not work for the deep web. For example, these crawlers do not try to find dynamic pages that are the result of various requests from the database because the exact number of possible requests is not known. Of course, it is pointed out that this problem can be solved to some extent by providing links to the results of requests, but this can inadvertently increase the popularity of a deep web member.
DeepPeep, Intute, Deep Web Technologies, Scirus and Ahmia.fr are some of the search engines that have access to deep web. Intute is out of budget and is currently just a static archive. Scirus also retired in late January 2013.
Researchers are looking for ways to automatically crawl deep into the web. This also applies to content that is only accessible through specific software such as Tours. In 2001, Sriram Raqwan and Hector Garcia Molina (of Stanford University School of Computer Science) developed an architectural model for the hidden web crawler that used user-generated keywords or collected from request interfaces to create a Submit an application form and crawl deep web content. Alexandros Entoulas, Petros Zerfos, and Junghu Chu of UCLA created a hidden web crawler that automatically generated meaningful requests to fill out search forms. Several form request languages (such as DEQUEL) are provided that, in addition to creating a request, allow you to extract organized data from the result pages. Another effort is DeepPeep, a project at the University of Utah sponsored by the National Science Foundation. This project collects hidden web resources (web forms) in different domains with the help of new methods of centralized crawling.
Commercial search engines have also begun searching for other ways to crawl the web. The Sitemap protocol (first developed and introduced by Google in 2005) and the OAI-PMH are mechanisms that allow search engines and other enthusiasts to find deep web resources on specific web servers. Both mechanisms allow web servers to advertise URLs that are accessible to them, allowing them to automatically find resources that are not directly linked to the surface web.
Google Hidden Web Detection System calculates submitted requests for each HTML form and adds the resulting HTML pages to Google Search Engine List. The revealed results are the result of processing thousands of requests per second for deep web content. In this system, pre-calculation of registered requests is done using three algorithms:
- Select input values as text search inputs that accept keywords
- Identify inputs that only accept values of a certain type (for example, date)
- Select a small number of input combinations that generate the right URLs to list in a web search directory
In 2008, Aaron Swartz designed Tor2web to make it easier for Tor network secret service users to access and search for hidden .onion extensions. This proxy program can also access these services through regular browsers. With this program, deep web links appear as random strings of letters with the .onion extension.