Monday, April 1, 2019

Accessing The Deep Web Computer Science Essay

Accessing The Deep clear Computer Science EssayThe mankind Wide Web has grown from few thousand blade rapsc eitherions in 1993 to al approximately 2 billion entanglement rascals at present. It is a salient starting time of in multifariousnessation sharing. This source of information is avail adequate in varied forms text, images, audio, video, tables etc. People use this information via net blade browsers. Web browser is an exercise to browse blade on internet. front railway locomotives argon utilise to essay specific info from the pool of heterogeneous information 1. In the rest of this chapter I will how people thunder mug attend germane(predicate) information, how seek railway locomotive perishs, what a squincher is, how it works, and what related literature just ab come on(predicate) the revealicular problem is.SEARCH ENGINEA calculate railway locomotive is a broadcast to anticipate for information on the internet. The results against a search inter rogation given over by substance ab exploiter be presented in a list on a nett page. Each result is a link to near network page that contains the specific information against the given query. The information green goddess be a net page, an audio or video file, or a multimedia document. Web search locomotive locomotive engines work by storing information in its entropybase. This information is collected by creeping each link on a given wind vane site. Google is considered a most powerful and heavily used search engine in these days. It is a large scale general purpose search engine which can creeping and proponent millions of sack up pages all day 7. It provides a good start for information retrieval save whitethorn be insufficient to manage complex information inquiries those requires some extra familiarity.network CRAWLERA meshwork angleworm is a computer program which is use to browse the World Wide Web in a automatic and systematic manner. It browses the web a nd save the visited selective information in database for coming(prenominal) use. Search engines use crawler to crawl and big businessman the web to happen upon the information retrieval easy and efficient 4.A conventional web crawler can just now happen surface web. To crawl and index the hidden or racy web requires extra effort. Surface web is the parcel of land of web which can be indexed by conventional search engine 11. Deep or hidden web is a portion of web which cannot be crawled and indexed by conventional search engine 10. rich WEB AND DIFFERENT APPROACHES TO DISCOVER ITDeep web is a part of web which is not part of surface web and lies screw hypertext markup language forms or the dynamic web 10. Deep web heart can be classified into following formsDynamic Content this is a emblem of web contents which ar accessed by submitting some input honor in a form. Such kind of web requires domain knowl pass on and without having knowledge, navigating is very hard.Unlinked Content These atomic number 18 the pages which are not linked in other pages. This thing may prevent it from crawling by search engine.Private Web These are the sites which require alteration and login information.Contextual Web These are the web pages which are varying for incompatible access context. mode account Access Content These are site which limit its access to their pages. compose Content This is a portion of web which is only accessible by means of colligate produced by JavaScript as well as content dynamically invoke by AJAX functions.Non-HTML/ Text Content The textual contents which are encoded in images or multimedia files cannot handled by search engines.6All these cause a problem for search engine and for public because a quite a little of information is invisible and a common user of search engine even dont know that might be the most important information is not accessible by him/her just because of in a higher place properties of web applications. The Dee p Web is also believed that it is a big source of structured data on the web and retrieving it is a big quarrel for data management community. In fact, this is a myth that deep web is establish on structured data which is in fact not true because deep web is a significant source of data most of which is structured but not only one. 8.Researchers are hard to befall out the way to crawl the deep web content and they obtain succeeded in this regard but still there are a lot of future research problems. One way to search deep web content is domain specific search engine or vertical search engine much(prenominal) as worldwidescience.org and science.org. These search tools are providing a link to national and international scientific databases or portals 7. In literature there are two other techniques to crawl the deep web content Virtual Integration and Surfacing. The virtual consolidation is used in vertical search engine for specific domains a want cars, books, research work etc . In this technique a mediator form is earnd for each domain and semantic mappings between individual data and mediator form. This technique is not suitable for standard search engine because creating mediator forms and mappings cost very high. Secondly, indentifying queries germane(predicate) to each domain is a big challenge and the last is that information on web is almost everything and boundaries cannot be clearly defined. Surfacing uses a technique to pre-calculate the most applicable input value for all appealing HTML forms. The URLs resulting from these form submission are produced off-line and indexed analogous a normal URL. When user query for a web page which is in fact a deep web content, search engine automatically fill the form and guide the link to user. Google uses this technique to crawl deep web content. This technique is otiose to surface scripted content 5. Today most web applications are AJAX based because it abridged the surfing effort of user and netw ork art 12, 14. Gmail, yahoo mail, hotmail and Google maps are famous AJAX applications. The major goal of AJAX based applications is to promote the user experience by running client code in browser instead of refreshing the page from emcee. The second goal is to reduce the network traffic. This is achieved by refreshing only a part of page from server 14. AJAX has its own limitations. AJAX applications refresh its content without changing URL which is a worm for crawler because crawlers are unable to identify clean state. It is like a undivided page web site. So, it is essential to explore some mechanism to make AJAX crawl-able. To surface the web contents those are only accessible through JavaScript as well as contents behind URLs dynamically downloaded from web server via AJAX functions 5, there are diametrical hurdling those are prevent the web to expose in front of crawlersSearch engines pre-cache the web site and crawl locally. AJAX applications are event based so event s cannot be cached.AJAX applications are event based so there may be several events that lead to the same state because of same key JavaScript function is used to provide the content. It is necessary to identify redundant states to optimize the crawling results 14.The entry point to the deep web is a form. When a crawler finds a form, it needs to guess the data to fill out the form 15, 16. In this situation crawler needs to react like a human.There are many solutions to resolve these problems but all have their limitations. Some application developer provides custom search engine or they expose web content to traditional search engine based on agreement. This is a manual solution and requires extra contribution from application developers 9. Some web developers provide vertical search engine on their web site which is used to search specific information about their web site. There are many companies which have two interfaces of their web site. One is dynamic interface for users con venient and one is alternate inactive view for crawlers. These solutions only discover the states and events of AJAX based web content and send away the web content behind AJAX forms. This research work is going to provide solution to discover the web content behind AJAX based forms. Google has proposed a solution but still this project is undergone 9.The process of crawling web behind AJAX application becomes much complicated when a form encounters and crawler needs to identify the domain of the form to fill out the data in form to crawl the page. Another problem is that no form has the same structure. For example, a user looking for a car finds divers(prenominal) kind of form than a user looking for a book. because there are different form synopsiss which make reading and appreciation of form more complicated. To make the forms crawler read-able and understand-able, the whole web should be classified in small categories, each category belongs to a different domain and each domain has a common form schema which is not possible. There is another climb, focused crawler. Focused crawlers try to retrieve only a subset of the pages which contains most germane(predicate) information against a extra topic. This approach leads to better indexing and efficient searching than the first approach 17. This approach will not work in some situations where a form has a parent form. For example, a student fills a registration form. He/she enters country name in a field and succeeding(prenominal) combo dynamically load city names of that particular country. To crawl the web behind AJAX forms, crawler needs special functionality.CRAWLING AJAX traditionalistic web crawlers discover new web pages by starting from cognize web pages in web directory. Crawler examines a web page and extracts new links (URLs) and then follows these links to discover new web pages. In other words, the whole web is a directed graph and a crawler traverse the graph by a traversal algorithm 7. As mentioned above, AJAX based web is like a single page application. So, crawlers are unable to crawl the whole web which is AJAX based. AJAX applications have a series of events and states. Each event is act as an edge and states act as nodes. Crawling states is already done in 14, 18, but this research is left the portion of web which is behind AJAX forms. The focus of this dissertation is to crawl web behind AJAX forms.INDEXINGIndexing means creating and managing index of document for making searching and accessing desired data easy and ready. The web indexing is all about creating indexes for different web sites and HTML documents. These indexes are used by search engine for making their searching fast and efficient 19. The major goal of any search engine is to create database of larger indexes. Indexes are based on organized information such as topics and names that serve as entry point to go directly to desired information within a corpus of documents 20. If the web craw ler index has enough space for web pages, then those web pages should be the most relevant to the particular topic. A good web index can be maintained by extracting all relevant web pages from as many different servers as possible. Traditional web crawler takes the following approach it uses a modified breadth-first algorithm to chink that every server has at least one web page stand for in the index. Every time, when a crawler encounters a new web page on a new server, it retrieves all its pages and indexes them with relevant information for future use 7, 21. The index contains the key words in each document on web, with pointers to their locations within the documents. This index is called an inverted file. I have used this strategy to index the web behind AJAX forms. oppugn PROCESSERQuery central processing unit processes query entered by user in rig to match results from index file. User enters his/her request in the form of a query and query processor retrieves some or all l inks and documents from index file that contains the information related to the query and present to the user in a list of results 7, 14. This is a simple interface that can find relevant information with ease. Query processors are normally built by breadth first search which make sure that every single server containing relevant information has many web pages represented in the index file 17. This kind of design is important for users, as they can usually navigate within a server more easily that navigating across many servers. If a crawler discovers a server as containing useful data, user will possibly be able to search what they are searching for. Review this after implementing query processor in my thesis.RESULT COLLECTION AND PRESENTATIONSearch results are displayed to user in the form list. The list contains the URLs and words those matches to the search query entered by user. When user make a query, query processor match it with index, find relevant match and display all th em in result page 7. There are several result collection and representation techniques are available. One of them is grouping similar web pages based on the rate of occurrence of a particular key words on different web pages 15. Need a reviewCHAPTER 3SYSTEM computer architecture AND DESIGNCHAPTER 4EXPERIMENTS AND RESULTSCHAPTER 5FUTURE WORKCHAPTER 6 shutdown

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.