WebIC Research



While the World Wide Web contains a vast quantity of information, it is often difficult for web users to find the information they really want. The WebIC recommendation system suggests "information content" (IC) pages --- i.e., pages the user must examine to accomplish her current task. WebIC uses browsing behaviors of the words to locate these IC-pages anywhere on the Web.

 

How WebIC differs from other web recommender systems?

Like other recommendation systems, WebIC watches a user as she navigates through a sequence of pages, and suggests pages that (it hopes, will) provide relevant information. For more information about Web mining in general, please visit Web Mining Resource.

WebIC differs in several respects. First, as many recommendation systems are server-side, they can only provide information about one specific website, often based on correlations amongst the pages that previous users have visited. By contrast, WebIC is not specific to a single website, but can point users to pages anywhere in the Web. The fact that our intended coverage is the entire Web leads to a second difference: support.

As any single website has a relatively small number of pages, a website-specific recommendation system can expect many pages to have a large number of hits; it can therefore focus only on these high support (read "highly-visited") pages. Over the entire WWW, however, very few pages will have high support. WebIC uses a different approach to finding recommended pages, based on the users' abstract browsing patterns.

The third difference deals with the goal of the recommendation system: Many recommendation systems first determine other users that appear similar to the current user B, then recommend that B visit the pages that other similar users have visited. Unfortunately, there is no reason to believe that these correlated pages will contain information useful to B. Indeed, these suggested pages may correspond simply to irrelevant pages on the paths that others have taken towards their various goals, or worse, simply to standard dead-ends that everyone seems to hit. By contrast, our goal is to recommend only "information content" (IC) pages.
correlation based recommendation

To determine this, we learn user "Browsing Behavior Model" from a set of Annotated Web-Log, and our learning algorithm learned to characterize how the user locates useful information from the Web, and apply this model to generate recommendation.

 

"Information Content" Pages

People may have several modes when using a web browser. One mode could be "Sunday shopping", where the user has no specific goal in mind, and just "killing time". We focus, however, on the situation where the user has some specific goal(s) in mind (e.g., deciding on a school for graduate work, deciding which used car to purcharse, etc.), that is, her browsing is goal-directed, as she is seeking some specific information from the Web.

The user needs some specific information content (IC) to complete her specific task, e.g., prof's research information for graduate application, or pages that describe product that the user purchase (on-line shopping).

Our goal is to provide that IC. As Web content is organized into pages, this corresponds to identifying "Information Content" pages (IC-pages). We focus on the case where the content of page arise from the words on page. We therefore identify certain specific words with Information Content -- so called "IC-words" -- and so seek pages that contain these words. So if we can predict the IC-words, then we can identify the IC-pages that match the predicted IC. Note the Infomation Content (and hence the IC-words) are specific to the user's current session.

Suppose a user starts a session at page p, whose content can be represented by a vector of words with some properties, such as how many times in the title, the number of occurences in the page, etc. Either p is an IC-page or user takes an action (e.g., follow link, back up, type into field, ...) to search for her IC. Here, actions indirectly evaluate page content (e.g., p's words Pw), which consequently helps to evaluate the IC-ness of Pw. We can predict which words are or are not IC-words based on the action sequence that applied on the pages that contain these words. In our research, we use browsing behaviour to generalize the action sequence that applied on each word in the whole page sequence, such as how often the word appears in the title, how often the word in the anchor text of hyperlinks that followed, etc.

 

Browsing Behavior Models

We assume that people in general follow some very general rules to locate the information they are seeking, that is, not based on one web site or a specific set of words, but a model that describes how web users find useful information. If we can detect such patterns, and use them to predict a web user's current information need, we may provide useful content recommendations. Since our model is not based on one specific web site or a specific set of words, it can be applied in many different circumstances. While there may be important inter-individual differences in users' search behavior, we anticipate that our model - due to its high level of abstraction - will be able to capture general behavioral patterns that apply to broad classes of users and environments.

Browsing Behavior

Page-Action Sequence

The original page sequence can be represented by a page-action sequence, which is a sequence of {..., (pi, ai), ...} where pi is page i in the session and ai is the action applied to that page by the user, such as "follow-up", "back-off", "forward", etc. We infer the user's attitude towards the content of the pages from the actions the user applies to the pages. For the example (Show Me), the page-action sequence is


Page-Action Sequence

Word Role-Action Sequence

The content of the page is communicated by the roles that words play on the page. We make the simplified assumption that we can represent the information need of the session by a set of significant words from the session. Recall we call these significant words information content words (IC-words). We further asume that the significance of words can be judged independently for each word from the roles played by instances of the word on pages throughout the sequence (e.g., appearing as plain text, highlighted in a title, appearing in link anchor text, etc.) and the actions the user implicitly applies to the page containing the instances. We define the role-action sequence for word w to be a vector of roles played by word w in its each instance, and the action applied to that instance page. Before we deployed the WebIC system, we first ran a controlled study, where subjects explicitly indicated which pages were IC-pages. (We are running additional studies, to obtain yet more information...) (Note these IC-page bits could also be determined during earlier sessions of this specific user.) For each word, we also know whether it is an IC-word or not, i.e., appearing in IC-page. For example, the role-action sequence of "whale" in the example (Show Me) would be


Role-Action Sequence
We would like to characterize the significance of the role played by word w in terms of a compact set of features. In general, a feature is a function of a role-action sequence. For instance we might want to define a feature , f_c(w), that specifies the number of times the word w appears in a user's session Another feature, f_f(w), specifies the percentage of times the user followed a link whose anchor included w. The browsing features are the description of how the user treats the information, when she is trying to locate the useful information on the Web. Here are browsing features for these words that while searching "whale" on the Web.

 

IC-word Model

For each word, we extract its browsing features, and also a bit to indicate whether it is an IC-word or not. Notice the word itself is irrelevant; the important characters are its browsing properties, which reflect how the user dealt with the word in the current session. For example, we would expect to find rules of the form

any word that appears in the followed hyperlinks,
but never in the content of backed-up page,
is likely to be an IC-word

Notice this rule does not talk about "whale" or "football".

Given a set of annotated web-logs, we can uses a learning algorithm to produce a classifier. This empirical evidence (accuracy) shows that our approach, and our algorithms, work effectively.

Decision Tree (C4.5) 87.4%
Naive Bayes 78.3%
Ripper 73.7%