|
While the World Wide Web contains a vast quantity of information, it is often difficult for web users to find the information they really want. The WebIC recommendation system suggests "information content" (IC) pages --- i.e., pages the user must examine to accomplish her current task. WebIC uses browsing behaviors of the words to locate these IC-pages anywhere on the Web.
Like other recommendation systems, WebIC watches a user as she navigates through a sequence of pages, and suggests pages that (it hopes, will) provide relevant information. For more information about Web mining in general, please visit Web Mining Resource. WebIC differs in several respects.
First, as many recommendation systems are server-side,
they can only provide information about
one specific website, often based on correlations amongst the pages
that previous users have visited.
By contrast, WebIC is not specific to a single website,
but can point users to pages anywhere in the Web.
The fact that our intended coverage is the entire Web leads to a second
difference: support. As any single website has a relatively small number of pages,
a website-specific recommendation system can expect many pages
to have a large number of hits;
it can therefore focus only on these high support
(read "highly-visited") pages.
Over the entire WWW, however, very few pages will have high support.
WebIC uses a different approach to finding recommended pages,
based on the users' abstract browsing patterns. The third difference deals with the goal of the recommendation system:
Many recommendation systems
first determine other users that appear similar to the current user B,
then recommend that B visit the pages that other similar users
have visited.
Unfortunately, there is no reason to believe that these correlated pages will
contain information useful to B.
Indeed, these suggested pages may correspond simply to irrelevant
pages on the paths that others have taken towards their various
goals, or worse, simply to standard dead-ends that everyone seems
to hit.
By contrast, our goal is to recommend only
"information content" (IC) pages.
To determine this,
we learn user "Browsing Behavior Model" from a set of
Annotated Web-Log, and our learning algorithm learned to characterize
how the user locates useful information from the Web, and apply this model
to generate recommendation.
People may have several modes when using a web browser.
One mode could be "Sunday shopping", where the user has no specific goal in mind,
and just "killing time".
We focus, however, on the situation where the user has some specific goal(s)
in mind (e.g., deciding on a school for graduate work,
deciding which used car to purcharse, etc.), that is,
her browsing is goal-directed, as she is seeking some specific information
from the Web. The user needs some specific information content (IC) to complete her specific task, e.g., prof's research information for graduate application, or pages that describe product that the user purchase (on-line shopping). Our goal is to provide that IC. As Web content is organized into pages, this corresponds to identifying "Information Content" pages (IC-pages). We focus on the case where the content of page arise from the words on page. We therefore identify certain specific words with Information Content -- so called "IC-words" -- and so seek pages that contain these words. So if we can predict the IC-words, then we can identify the IC-pages that match the predicted IC. Note the Infomation Content (and hence the IC-words) are specific to the user's current session. Suppose a user starts a session at page p,
whose content can be represented by a vector of words with some properties, such as
how many times in the title, the number of occurences in the page, etc.
Either p is an IC-page or user takes an action
(e.g., follow link, back up, type into field, ...)
to search for her IC. Here, actions indirectly evaluate page content
(e.g., p's words Pw),
which consequently helps to evaluate the IC-ness of Pw.
We can predict which words are or are not IC-words based on the action sequence
that applied on the pages that contain these words. In our research, we use
browsing behaviour to generalize the action sequence that applied on each word
in the whole page sequence, such as how often the word appears in the title, how often
the word in the anchor text of hyperlinks that followed, etc.
We assume that people in general follow some very general rules to locate the information they are seeking, that is, not based on one web site or a specific set of words, but a model that describes how web users find useful information. If we can detect such patterns, and use them to predict a web user's current information need, we may provide useful content recommendations. Since our model is not based on one specific web site or a specific set of words, it can be applied in many different circumstances. While there may be important inter-individual differences in users' search behavior, we anticipate that our model - due to its high level of abstraction - will be able to capture general behavioral patterns that apply to broad classes of users and environments.
Page-Action Sequence The original page sequence can be represented by a page-action sequence, which is a sequence of {..., (pi, ai), ...} where pi is page i in the session and ai is the action applied to that page by the user, such as "follow-up", "back-off", "forward", etc. We infer the user's attitude towards the content of the pages from the actions the user applies to the pages. For the example (Show Me), the page-action sequence is
Word Role-Action Sequence The content of the page is communicated by the roles that words play on the page. We make the simplified assumption that we can represent the information need of the session by a set of significant words from the session. Recall we call these significant words information content words (IC-words). We further asume that the significance of words can be judged independently for each word from the roles played by instances of the word on pages throughout the sequence (e.g., appearing as plain text, highlighted in a title, appearing in link anchor text, etc.) and the actions the user implicitly applies to the page containing the instances. We define the role-action sequence for word w to be a vector of roles played by word w in its each instance, and the action applied to that instance page. Before we deployed the WebIC system, we first ran a controlled study, where subjects explicitly indicated which pages were IC-pages. (We are running additional studies, to obtain yet more information...) (Note these IC-page bits could also be determined during earlier sessions of this specific user.) For each word, we also know whether it is an IC-word or not, i.e., appearing in IC-page. For example, the role-action sequence of "whale" in the example (Show Me) would be
For each word, we extract its browsing features, and also a bit to indicate
whether it is an IC-word or not.
Notice the word itself is irrelevant; the important characters are
its browsing properties, which reflect how the user dealt with the word in
the current session.
For example, we would expect to find rules of the form
any word that appears in the followed hyperlinks, Given a set of annotated web-logs, we can uses a learning algorithm to produce a classifier. This empirical evidence (accuracy) shows that our approach, and our algorithms, work effectively.
|