RSS punter

RSS—RDF Site Summary [1] or alternatively Really Simple Syndication—is an XML format used for describing the content of a web site, where that site typically contains news items, diary entries, event information or generally anything that grows, item by item, over time. A classic application of RSS is to describe a news site such as JabberCentral. [2] JabberCentral's main page (see Figure 8-5) consists of a number of news items—in the "Recent News" section—about Jabber and the community (what else?). These items appear in reverse chronological order, and each one is fairly succinct, sharing a common set of properties:

Title

Each item has a title ("JabberCon Update 11:45am - Aug 20").

Short description

For each item, there's a short piece of text describing the content and context of the news story ("Jabbercon Update - Monday Morning").

Link to main story

The short description should be enough to help the reader decide if he wants to read the whole item. If he does, there's a link ("Read More") to the news item itself.

Figure 8-5. JabberCentral's main page

It is this collection of item-level properties that are summarized in an RSS file. The formality of the XML structure makes it a straightforward matter for automating the retrieval of story summaries for inclusion in other sites (syndication), for the combination of these items with items from other similar sources (aggregation), and for simply checking to see if there is any new content (new items) since the last visit.

Example 8-6 shows what the RSS XML for the JabberCentral news items shown in Figure 8-5 looks like.

Example 8-6. RSS source for JabberCentral

<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
     "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">

  <channel>

    <title>JabberCentral</title>

    <description>
      JabberCentral is the premiere Jabber end-user news and support
      site. Many Jabber developers are actively involved at JabberCentral
      to provide fresh and authoritative information for users.
    </description>

    <language>en-us</language>
    <link>http://www.jabbercentral.com/</link>
    <copyright>Copyright 2000, Aspect Networks</copyright>

    <image>
      <url>http://jabbercentral.com/images/jc_button.gif</url>
      <title>JabberCentral</title>
      <link>http://www.jabbercentral.com/</link>
    </image>
  
    <item>
      <title>JabberCon Update 11:45am - Aug 20</title>
      <link>http://www.jabbercentral.com/news/view.php?news_id=998329970</link>
      <description>JabberCon Update - Monday Morning</description>
    </item>
  
    <item>
      <title>Jabcast Promises Secure Jabber Solutions</title>
      <link>http://www.jabbercentral.com/news/view.php?news_id=998061331</link>
      <description>
        Jabcast announces their intention to release security
        plugins with their line of products and services.
      </description>
    </item>

    ... (more items) ...

  </channel>

</rss>

The structure is very straightforward. Each RSS file describes a "channel" (<channel/>). Here's how a channel is defined:

Channel Information

The channel in this case is JabberCentral. The channel header information includes the channel's title (<title/>), short description (<description/>), main URL (<link/>) and so on.

Channel Image

Often RSS information is rendered into HTML to provide a concise "current index" summary of the channel it describes. An image can be used in that summary rendering, and its definition is held in the <image/> section of the file.

Channel Items

The bulk of the RSS file content is made up of the individual <item/> sections, each of which reflect an item on the site that the channel represents. We can see in Example 8-6 that the first <item/> tag:

<item>
  <title>JabberCon Update 11:45am - Aug 20</title>
  <link>http://www.jabbercentral.com/news/view.php?news_id=998329970</link>
  <description>JabberCon Update - Monday Morning</description>
</item>

describes the most recent news item shown on JabberCentral's main page—"JabberCon Update 11:45am - Aug 20". Each of the news item properties are contained within that <item/> tag: the title (<title/>), short description (<description/>), and link to main story (<link/>).

Channel Interactive Feature

There is a possibility for each channel to describe an interactive feature on the site it represents; often this is a search engine which is fronted by a text input field and submit button. The interactive feature section of an RSS file is used to describe how that mechanism is to work (the name of the input field and the submit button, and the URL to invoke when the button is pressed, for example). This is so that HTML renderings of the site can include the feature otherwise only available on the original site.

This interactive feature definition is not shown in our RSS example.

RSS information lends itself very well to various methods of viewing. There are custom "headline viewer" clients available—focused applications that allow you to select from a vast array of RSS sources and have links to items displayed on your desktop (so yes, the personal newspaper—of sorts—is here!). There are also possibilities for having RSS items scroll by on your desktop control bar.

And then there's Jabber. As described in the section called The Message Element in Chapter 5, the Jabber <message/> element can represent something that looks suspiciously like an RSS item. The message type "headline" defines a message that carries news headline information. In this case, the <message/> element itself is usually embellished with an extension, qualified by the jabber:x:oob namespace, which is described in the section called jabber:x:oob in Chapter 5a. If the first news item from the JabberCentral site were to be carried in a headline message, Example 8-7 shows what the element would look like.

Example 8-7. A headline message carrying a JabberCentral news item

<message type='headline' to='dj@qmacro.dyndns.org'>
  <subject>JabberCon Update 11:45am - Aug 20</subject>
  <body>JabberCon Update - Monday Morning</body>
  <x xmlns='jabber:x:oob'>
    <url>http://www.jabbercentral.com/news/view.php?news_id=998329970</url>
    <desc>JabberCon Update - Monday Morning</desc>
  </x>
</message>

It's the extension, qualified by the jabber:x:oob namespace, that carries the crucial parts of the RSS item. And there are off the shelf clients such as WinJab and Jarl that can understand this extension and do something useful if a headline type message is received: the item content is displayed in a clickable list of lines, each one representing a single RSS item, akin to the headline viewer clients mentioned earlier.

Of course, we could simply punt RSS items to clients in non-headline type messages:

<message type='headline' to='dj@qmacro.dyndns.org'>
  <subject>JabberCon Update 11:45am - Aug 20</subject>
  <body>
    JabberCon Update - Monday Morning
    http://www.jabbercentral.com/news/view.php?news_id=998329970
  </body>
  </x>
</message>

where the complete item information is transmitted in a combination of the <subject/> and <body/> tags. This works too, but has the disadvantage of presenting no context within which the URL can be interpreted by the receiving clients, meaning all they can do with the message is display it as they would display any other message. The key is that we can send formalized metadata which increases the value of our message content enormously. Figure 8-7 shows Jarl displaying RSS-sourced news headlines.

Distributing RSS-sourced headlines over Jabber to standard Jabber clients is a great combination of off the shelf technologies. In fact, we'll see in the next section that it's not just standard Jabber clients that fit the bill; we'll write a Jabber-based headline viewer to show that not all Jabber clients are, nor should they be, made equal. But anyway, let's get to it!

The plan

We're going to write an RSS punter. A mechanism that checks pre-defined sources for new RSS items, and punts (or pushes) them to people who are interested in receiving them. For the sake of simplicity, we'll define the list of RSS sources in the script itself. See the section called Further ideas for ideas on how to develop this recipe further.

The script as component

Until now, the recipes we've written—such as the CVS Notification, the Dialup System Watch, and the Keyword Assistant, all in Chapter 7—have all existed as Jabber clients. That is, they've performed a service while connected to the Jabber server via the JSM (Jabber Session Manager). There's nothing wrong with this, indeed it's more than just fine to build Jabber-based mechanisms using a Jabber client stub connection; that way, your script, through its identity—the user JID— can avail itself of all the IM-related functions that the JSM offers—presence, storage and forwarding of messages, and so on. Perhaps even more interesting is that the mechanism needs only an account, a username and password, on a Jabber server to be part of the big connected picture. One can look at this sort of client-connected mechanism as an effective and low-cost entry to building the 'A's in our Jabber-connected A2A, A2P and P2A world of acronyms (or "acronym-qualified worlds"—you decide).

However, we know from Chapter 4 that there are other entities that connect to Jabber to provide services. These entities are called components. You can look upon components as being philosophically less "transient" than their client-connected brethren; and also closer to the Jabber server in terms of function and connection.

We know from the section called jabberd and Components in Chapter 4 that there are various ways to connect a component: Library load, STDIO, and TCP sockets. The first two dictate that the component be located on the same host as the jabberd backbone to which it connects. [3] The TCP sockets connection type, a method using a socket connection between the component and the jabberd backbone, over which streamed XML documents are exchanged (in the same way as they are exchanged in a client connection), allows us to run components on any host, and connect them to a Jabber server running on another host if we wish. Because of the connection flexibility, this approach is in many ways the most desirable. But it's not just the flexibility; because it abstracts the component away from the Jabber server core libraries, it leaves it up to us to decide how the component should be written. All the component has to do to get the Jabber server to cooperate is to establish the socket connection as described in the component instance configuration, perform an authenticating handshake, and correctly exchange XML stream headers.

Let's review how a TCP socket-based component connects. We'll base the review on what we're actually going to have to do to get our RSS punter up and running.

First, we have to tell the Jabber server that it is to expect an incoming socket connection attempt, which it is to accept. We do this by defining a component instance definition (or "description"—see the section called Component instances in Chapter 4) for our component. We include this definition in the main Jabber server configuration file, usually called jabber.xml. Example 8-8 shows a component instance definition for our RSS punter mechanism, known as rss.qmacro.dyndns.org.

Example 8-8. A component instance definition for our RSS punter mechanism

<service id='rss.qmacro.dyndns.org'>
  <accept>
    <ip>localhost</ip>
    <port>5999</port>
    <secret>secret</secret>
  </accept>
</service>

The name of the host on which the main Jabber server is running is qmacro.dyndns.org; it just so happens that our plan is to run the RSS punter component on the same host. We give it a unique name (rss.qmacro.dyndns.org) to enable the jabberd backbone, or hub, to distinguish it from other components and to be able to route elements to it. An alternative way of writing the component instance definition is shown in Example 8-9. The difference is simply in the way we specify the name. In Example 8-8 we specified an id in the <service/> tag with the value "rss.qmacro.dyndns.org". In the absence of any <host/> tag specification in the definition, this id value is used by the jabberd routing logic as the identification for the component when determining where elements addressed with that destination should be sent. In Example 8-9, we have an explicit <host/> specification which will be used instead, and we simply identify the service with an id attribute value of "rss". In this latter case, it doesn't really matter from an addressability point of view what we specify as the value for the id attribute.

Example 8-9. An alternative instance definition for our RSS punter mechanism

<service id='rss'>
  <host>rss.qmacro.dyndns.org</host>
  <accept>
    <ip>localhost</ip>
    <port>5999</port>
    <secret>secret</secret>
  </accept>
</service>

The instance definition contains all the information the Jabber server needs. We can tell from the <accept/> tag that this definition definition describes a TCP sockets connection. The socket connection detail is held in the <ip/> and <port/> tags. In this case, as we're going to run the RSS punter component on the same host as the Jabber server itself, we might as well kill two related birds with one stone by specifying localhost in the <ip/> tag: [4]

Performance

Connecting over the loopback device, as opposed to a real network interface, will give us a slight performance boost.

Security

Accepting only on the loopback device is a simple security measure that leaves one less port open to the world.

The <secret/> tag holds the secret that the connecting component must present in the authentication handshake.

Now let's look at the component's view of things. It will need to establish a socket connection to 127.0.0.1:5999. Once that connection has been established, jabberd will be expecting it to announce itself by sending its XML document stream header. Example 8-10 shows a typical stream header that our component will need to send.

Example 8-10. The RSS component's stream header

SEND: <?xml version='1.0'?>
      <stream:stream xmlns='jabber:component:accept' 
                     xmlns:stream='http://etherx.jabber.org/streams'
                     to='localhost'>

This matches the description of a Jabber XML stream header (also known as a stream "root" as it's the root tag of the XML document) from the section called XML Streams in Chapter 5. The namespace that is specified as the one qualifying the content of the stream is jabber:component:accept. This namespace "matches" the component connection method (TCP sockets) and the significant tag name in the component instance definition (<accept/>). [5] The value specified in the to attribute matches the hostname specified in the configuration's <ip/> tag.

After receiving a valid stream header, jabberd responds with a similar root to head up it's own XML document stream going in the opposite direction (from server to component). A typical response to the header in Example 8-10, received from the server by the component, is shown in Example 8-11.

Example 8-11. The server's stream header reply

RECV: <?xml version='1.0'?>
      <stream:stream xmlns:stream='http://etherx.jabber.org/streams'
                     id='3B8E3540'
                     xmlns='jabber:component:accept'
                     from='rss'>

The stream header sent in response shows that the server is confirming the component instance's identification as "rss". This reflects whatever was specified in the <service/> tag's id attribute of the component instance definition. Here, the value of the id attribute was "rss" as in Example 8-9.

It also contains an ID for the component instance itself: id='3B8E3540' in our example. The significance of this ID is as a random string shared between both connecting parties; the value is used in the next stage of the connection attempt—the authenticating handshake.

the section called Digest Authentication Method in Chapter 6 describes the digest authentication method of authentication for clients connecting to the JSM. This method uses a similar shared random string. On receipt of the server's stream header, the component takes the ID and and prepends it onto the secret that it must authenticate itself with. It then creates a NIST-SHA1 message digest (in a hexadecimal format) of that value:

SHA1_HEX(ID+SECRET)

Having created the digest, it sends it as the first XML fragment following the root, in a <handshake/> element:

SEND: <handshake id='1'>14d437033d7735f893d509c002194be1c69dc500</handshake>

On receipt of this authentication request, jabberd does the same thing: combines the ID value (after all, it knows what it is as it was jabberd that generated the value) with the value from the <secret/> tag in the component instance definition, and performs the same digest algorithm. If the digests match, the component is deemed to have authenticated itself correctly, and is send back an empty <handshake/> tag in conformation:

<handshake/>

The component may commence sending (and being sent) elements.

If the component sends an invalid handshake value—the secret may be wrong, or the digest may not have been calculated correctly—the connection is closed: jabberd sends a stream error and therewith ends the conversation:

RECV: <stream:error>Invalid handshake</stream:error>

Who gets punted what?

So, the definitions of the RSS sources are held within the script. But there's no reference to who might want to receive new items from which sources. We need a way for our component to accept requests, from users, that say things like:

"I'd like to have pointers to new items from the Slashdot site punted to me please"

or

"I'd also like pointers to new items from Jon Udell's site please"

or even

"Whoa, information overflow! Stop all my feeds!".

Let's take a leaf out of other components' books. There's a common theme that binds together components such as the JUD (Jabber User Directory), and the Transports to other IM systems such as Yahoo! and ICQ. This theme is registration. We've seen this before in the form of user registration, described in the section called User Registration in Chapter 6. This is the process of creating a new account with the JSM. Registration with a service such as the JUD, or an IM transport, however, follows a similar process. And both types of registration have one thing in common:

jabber:iq:register

The jabber:iq:register namespace is what's used in all cases to qualify the exchange of information during a registration process. [6] the section called jabber:iq:register in Chapter 5a describes the jabber:iq:register namespace. It shows us how a typical conversation between requester and responder takes place:

  1. The client sends an IQ-get: "How do I register?"

  2. The component sends an IQ-result: "Here's how. Follow these instructions to fill in these fields."

  3. The client then sends an IQ-set with values in the fields: "Ok, here's my registration request."

  4. To which the component responds, with another IQ-result: "Looks fine. Your registration details have been stored."

It's clear that this sort of model will lend itself well to the process of allowing users to make requests to receive pointers to new items from RSS sources chosen from a list. Example 8-12 shows this conversational model in Jabber XML. There are many fields that can be used in a registration request; the description in the section called jabber:iq:register in Chapter 5a includes a few of these— <name/>, <first/>, <last/>, and <email/>—but there are more. We'll take the <text/> field to accept the name of an RSS source when a user wishes to register his interest to receive pointers to new items from that source. The conversational model is shown from the component's perspective.

Example 8-12. A registration conversation for RSS sources

"How do I register?"

RECV: <iq type='get' id='JCOM_3' to='rss.qmacro.dyndns.org'
        from='dj@qmacro.dyndns.org/basement'>
        <query xmlns='jabber:iq:register'/>
      </iq>

"Here's how:"

SEND: <iq id='JCOM_3' type='result' to='dj@qmacro.dyndns.org/basement'
        from='rss.qmacro.dyndns.org'>
        <query xmlns='jabber:iq:register'>
          <instructions>
            Choose an RSS source from: Slashdot, JonUdell[, ...]
          </instructions>
          <text/>
        </query>
     </iq>

"Ok, here's my registration request:"

RECV: <iq type='set' id='JCOM_5' to='rss.qmacro.dyndns.org'
        from='dj@qmacro.dyndns.org/basement'>
        <query xmlns='jabber:iq:register'>
          <text>Slashdot</text>
        </query>
      </iq>

"Looks fine. Thanks."

SEND: <iq id='JCOM_5' type='result' to='dj@qmacro.dyndns.org/basement'
        from='rss.qmacro.dyndns.org'>
        <query xmlns='jabber:iq:register'>
          <text>Slashdot</text>
        </query>
      </iq>

(Time passes...)

"Whoa. I want out!"

RECV: <iq id='JCOM_11' to='rss.qmacro.dyndns.org' type='set'
        from='dj@qmacro.dyndns.org/basement'>
        <query xmlns='jabber:iq:register'>
          <remove/>
        </query>
      </iq>

"Ok, you're out."

SEND: <iq id='JCOM_11' to='dj@qmacro.dyndns.org/basement' type='result'
        from='rss.qmacro.dyndns.org'>
        <query xmlns='jabber:iq:register'>
          <remove/>
        </query>
      </iq>

We'll use a lightweight persistent storage system for our user/source registrations—DBM—to keep the script fairly simple.

One more thing, before we leave this registration section. How will the users know that they can register? What's even more critical—how will they know that the RSS punter actually exists? the section called Browsable Service Information in Chapter 4 explains the way that services can be described, announced even, by the JSM. Most clients, having connected to the server and established a session with the JSM, make a request for a list of agents (old terminology) or services (new terminology) that are available on that Jabber server, like this:

SEND: <iq id="wjAgents" to="qmacro.dyndns.org" type="get">
        <query xmlns="jabber:iq:agents"/>
      </iq>

The response to the request, which looks like this:

RECV: <iq id='wjAgents' to='dj@qmacro.dyndns.org/basement'
        type='result' from='qmacro.dyndns.org'>
        <query xmlns='jabber:iq:agents'>
          <agent jid='conf.qmacro.dyndns.org'>
            <name>Public Chatrooms</name>
            <service>public</service>
            <groupchat/>
          </agent>
          <agent jid='users.jabber.org'>
            <name>Jabber User Directory</name>
            <service>jud</service>
            <search/>
            <register/>
          </agent>
        </query>
      </iq>

reflects the contents of the <browse/> section in the JSM configuration as shown in Example 8-13.

Example 8-13. The JSM configuration's <browse/> section

<browse>
  <conference type="public" jid="conf.qmacro.dyndns.org" 
      name="Public Chatrooms"/>
  <service type="jud" jid="users.jabber.org" name="Jabber User Directory">
    <ns>jabber:iq:search</ns>
    <ns>jabber:iq:register</ns>
  </service>
</browse>

If we add a stanza for our RSS punter to the <browse/> section of the JSM configuration, that described our component, like this:

<service type="rss" jid="rss.qmacro.dyndns.org" name="RSS punter">
  <ns>jabber:iq:register</ns>
</service>

then we'll end up with an extra section in the jabber:iq:agents response from the server:

<agent jid='rss.qmacro.dyndns.org'>
  <name>RSS punter</name>
  <service>rss</service>
  <register/>
</agent>

The client-side effect of the agents response is exactly what we're looking for. Figure 8-6 shows WinJab's Agents menu displaying a summary of what it received in response to its jabber:iq:agents query. We can see that the stanza for our RSS punter was present in the <browse/> section and the component is faithfully displayed in the agent list, along with "Public Chatrooms" and "Jabber User Directory". In the main window of the screenshot we can see the "Supported Namespaces" list; it contains the namespace that we specified in our stanza. By specifying

<ns>jabber:iq:register</ns>

we're effectively telling the client that the component will support a registration conversation.

Figure 8-6. WinJab's "Agents" menu

But that's not all! We've advertised our RSS punter in the <browse/> section of the configuration for the JSM on the Jabber server running on qmacro.dyndns.org. That's why we got the information about the RSS punter agent when we connected as user dj to qmacro.dyndns.org—see the window's title bar in Figure 8-6. You may have noticed something odd about the definition of the other two agents, or services, in the <browse/> section earlier, or in the corresponding jabber:iq:agents IQ response. Let's take a look at this response again, this time with the extra detail about our component:

RECV: <iq id='wjAgents' to='dj@qmacro.dyndns.org/basement'
        type='result' from='qmacro.dyndns.org'>
        <query xmlns='jabber:iq:agents'>
          <agent jid='rss.qmacro.dyndns.org'>
            <name>RSS punter</name>
            <service>rss</service>
            <register/>
          </agent>
          <agent jid='conf.qmacro.dyndns.org'>
            <name>Public Chatrooms</name>
            <service>public</service>
            <groupchat/>
          </agent>
          <agent jid='users.jabber.org'>
            <name>Jabber User Directory</name>
            <service>jud</service>
            <search/>
            <register/>
          </agent>
        </query>
      </iq>

Imposter alert! While the jid attribute values for the RSS punter and Public Chatrooms agents show that they are components that are connected to the Jabber server we've just authenticated with (i.e. they both have JIDs in the qmacro.dyndns.org "space", and so are connected to the Jabber server running at qmacro.dyndns.org), the jid attribute for the Jabber User Directory points to a name in the jabber.org "space"! This is actually perfectly ok, and indeed is a side-effect of the power and foresight of Jabber's architectural design. If we connect a component, whether it's one we've built ourselves or one we've downloaded from the http://download.jabber.org site, we can give it an internal or an external identity when we describe it in the jabber.xml configuration.

Example 8-8 and Example 8-9 show two examples of an instance definition for our RSS punter component. Both specify potentially external identities. What this means is that if the hostname rss.qmacro.dyndns.org is a valid and resolvable hostname, the component can be reached from anywhere, not just from within the Jabber server to which it is connected. If the hostname wasn't resolvable by the outside world, by having a simple name such as rss, it could only be reached from the Jabber system to which it was connected.

So let's say rss.qmacro.dyndns.org is a valid and resolvable hostname. [7] If your client is connected to a Jabber server running on, say, yourserver.org, this is what would happen if you were to send, say, a registration request—an <iq/> element with a query qualified by the jabber:iq:register namespace—addressed to rss.qmacro.dyndns.org:

Packet reaches JSM on yourserver.org

You send the IQ from your client, which is connected to your Jabber server's JSM. So this is where the packet first arrives.

Internal routing tables consulted

yourserver.org's jabberd looks in its list of internally registered destinations, and doesn't find rss.qmacro.dyndns.org in there.

Name resolved and routing established

yourserver.org's dnsrv (Hostname Resolution) service is used to resolve the rss.qmacro.dyndns.org's address. Then, according to dnsrv's instance configuration (specifically the

<resend>s2s</resend>

part—see the section called Component Instance: dnsrv in Chapter 4), the IQ is then routed on to the s2s (Server to Server) component.

Server to server connection established

yourserver.org establishes a connection to qmacro.dyndns.org via s2s and sends the IQ across the connection.

Packet arrives at RSS punter component on qmacro.dyndns.org

jabberd on qmacro.dyndns.org routes the packet correctly to rss.qmacro.dyndns.org.

So, what do we learn from this?

As exemplified by the reference to the JUD running at users.jabber.org that comes pre-defined in the standard jabber.xml with the 1.4.1. version of the Jabber server, you can specify references to services, components, on other Jabber servers. If you take this RSS punter script (when we finally get to it!), and run it against your own Jabber server, there's no reason why you can't share its services with your friends who run their own Jabber server.

The key is not the reference in the <browse/> section. The key is the resolvability of component names as hostnames, and the ability of Jabber servers to route packets to each other. The stanza in <browse/> just makes it easier for the clients to know about and automatically be able to interact with services in general. Even if a service offered by a public component that wasn't described in the result of a jabber:iq:agents query, it wouldn't stop you from reaching it. But you'd have to be good at writing XML by hand in your browser's raw/debug mode ;-)

The version query in Example 8-14 is a good example of this. Regardless of whether the conference component at gnu.mine.nu was listed in the <browse/> section of the qmacro.dyndns.org's JSM, the user dj was able to make a version query by specifying the component's address, which was a valid and resolvable hostname, in the IQ-get's to attribute.

Polling the RSS sources

Now a quick word about the polling of the RSS sources. Remembering that the programming model with Jabber is usually event-based, and that we want to poll the RSS sources on a regular basis (although not every second!), we need some way of "interrupting" the process of checking for incoming elements and dispatching them to the callbacks, while we retrieve the RSS data and check for new items. There are many ways of achieving this; we're writing this component in Perl, so we could use the alarm() feature to set an alarm and have a subroutine invoked, to poll the RSS sources, when that alarm went off. This recipe uses the Jabber::Connection library, which negates the needs for an external alarm, as we will see when we come to the script.

Every time it's appropriate for us to poll the RSS sources, this is what we need to do, for each one:

  1. Try and retrieve the source from the URL we have

  2. Attempt to parse the source's XML

  3. Go through the items, until we come across one we've seen before. The ones we go through until then are deemed to be new. (We need a special case the first time around, so that we don't flood everyone with every item of a source the first time we retrieve it.)

  4. For new items, look in our registrations database for the users that have registered for that source, construct a headline message like the one shown in Example 8-7, and send it to those users.

  5. Remember the first of the new items, so that we don't go beyond it next time.

Other things to bear in mind

There are differences between programming a component and programming a client. We're already aware of many of the major ones, described in the section called The script as component. There are, however, also more subtle differences that we need to bear in mind.

As we know, components, unlike clients, do not connect to the JSM. They connect as a peer of the JSM. Not only does this mean, as already stated, that they cannot partake of the IM feast of features made available by JSM's modules (see the section called Component Connection Method in Chapter 4 for a list of these modules), but also that they must do more for themselves. [8] When constructing an element as a client, we should not specify a from attribute before we send it; this is added "by the server"—more precisely, by the JSM—as it arrives. This is to prevent JID spoofing. Because a component does not connect through the JSM, no "from-stamping" takes place: the component itself must stamp the element with a from attribute.

The addressing of a component is also slightly different. Whereas client addresses reflect the fact that they're connected to the JSM, always having the form:

[user]@[hostname]/[resource]

(the resource being optional), the basic address form of a component is simply:

[hostname]

This doesn't mean to say that the address of a component cannot have a [user] or a [resource] part. It's just that all elements addressed to:

anything@[hostname]/anything

will be routed by jabberd to the component. This means our component can play multiple roles, and have many personalities. We'll see an example of this in the script, where we construct an "artificial" [user]@[hostname] address for the from attribute of a <message/> element, to convey information.

The component will respond to IQ queries in the jabber:iq:register namespace. It is, in fact, "customary" for components to respond to queries in a set of common IQ namespaces, although by no means mandatory. Taking the JUD and Conferencing components, for example, we see that they both respond to IQ queries in the jabber:iq:time and jabber:iq:version namespaces. Example 8-14 shows a typical version query on a Conferencing component. This responsiveness is simply to provide a basic level of administrative information. We want our component to conform to the customs, so we'll make sure it also responds to queries in these namespaces.

Example 8-14. A Conferencing component responds to a version query

SEND: <iq type='get' to='conf.gnu.mine.nu'>
        <query xmlns='jabber:iq:version'/>
      </iq>

RECV: <iq type='result' to='dj@qmacro.dyndns.org/study' 
        from='conf.gnu.mine.nu'>
        <query xmlns='jabber:iq:version'>
          <name>conference</name>
          <version>0.4</version>
          <os>Linux 2.2.13</os>
        </query>
      </iq>

The script

It's time to let the dog see the rabbit. The component is written in Perl. You might want to refer to the script as a whole unit while reading through this section—you'll find it in the section called The script in its entirety. Ok. Let's go.

Setup

use strict;
use Jabber::Connection;
use Jabber::NodeFactory;
use Jabber::NS qw(:all);
use MLDBM 'DB_File';
use LWP::Simple;
use XML::RSS;

We're going to be using the Jabber::Connection library, so we declare that here; the library actually consists of three modules, and we're going to use them all. Jabber::Connection manages our connection to the server, and parses and dispatches incoming elements. Jabber::NodeFactory allows us to manipulate elements (generically called "nodes" by the module), and Jabber::NS provides us with a raft of constants that reflect namespaces and other common strings used in Jabber server, client and component programming.

We need a way of storing the registration information between invocations of the component script, and we'll use MLDBM for that. MLDBM is a really useful wrapper around the DB_File module. DB_File provides access to Berlekey DB database facilities using the tie() function. While you can't store references (i.e. complex data structures) via DB_File, you can when you use the MLDBM wrapper.

We will use the LWP::Simple module to grab the RSS sources by URL, and the XML::RSS module to parse those sources once retrieved.

my $NAME     = 'RSS Punter';
my $ID       = 'rss.qmacro.dyndns.org';
my $VERSION  = '0.1';
my $reg_file = 'registrations';
my %reg;

my %cache;

my %sources = (

  'jonudell' => 'http://udell.roninhouse.com/udell.rdf',
  'slashdot' => 'http://slashdot.org/slashdot.rdf',

  # etc ...

);

We start by declaring a few variables. We will see later in the script that $NAME, $ID and $VERSION will be used to reflect information in response to IQ queries. The variable $reg_file defines the name of the DB file to which we'll be tie()ing our registration hash %reg. %cache is our RSS item cache, to hold items that we've already seen, so we know when we've come to the end of the new items in a particular source.

We define our RSS sources in %sources. You may wish to define these differently, perhaps outside of the script. There are a couple of examples here; add your own favourite channels to taste.

tie (%reg, 'MLDBM', $reg_file) or die "Cannot tie to $reg_file: $!\n";

This magic line makes any data we store in the %reg hash persistent. It works by binding the operations on the hash (add, delete, and so on) to Berlekey DB operations, using the MLDBM module to stringify (and reconstruct) complex data structures so that they can be stored (and retrieved).

Connection

Right. We're ready to connect to the Jabber server as a component. Despite what's involved (described in the section called The script as component) it's very easy using a library such as Jabber::Connection:

my $c = new Jabber::Connection(
  server    => 'localhost:5999',
  localname => $ID,
  ns        => 'jabber:component:accept',
);

We construct a Jabber::Connection object, specifying the details of the connecion we wish to make. The server argument is used to specify the hostname, and optionally the port, of the Jabber server to which we wish to connect. In the case of a component, we must always specify the port (which is 5999 in our case, according to the component instance definition shown in Example 8-8). The same constructor can be used to create a client connection to Jabber, in which case a default port of 5222—the standard port for client connections—is assumed if none is explicitly specified. The localname argument is used to specify our name—the component's name—which in this case is rss.qmacro.dyndns.org. We specify the stream namespace with the ns argument. In the same way that a default port of 5222 is assumed if none is specified, a default stream namespace of jabber:client is assumed if no ns argument is specified. We wish to connect as a component using the TCP sockets connection method, so we must specify the appropriate namespace: jabber:component:accept.

This constructor call results in a stream header being prepared, one that looks like the one shown in Example 8-10.

The actual connection attempt, including the sending of the component's stream header, is done by calling the connect() method on the connection object in $c:

unless ($c->connect()) { die "oops: ".$c->lastError; }

This will return a true value if the connect succeeded (success is measured in whether the socket connection was established and whether the Jabber server sent a stream header in response). If it didn't succeed, we can retrieve details of what happened using the lastError() method.

We're connected. Before performing the authenticating handshake, we're going to do a bit of preparation:

$SIG{HUP} = $SIG{KILL} = $SIG{TERM} = $SIG{INT} = \&cleanup;

The idea is that the component will be run and only stopped in exceptional circumstances. If it is stopped, we want to clean things up before the script ends. Most importantly, we need to make sure our registration data is safe, but also we want to play nicely with the server and gracefully disconnect. This is done in the cleanup() function.

Preparation of the RSS event function and element handlers

debug("registering RSS beat");
$c->register_beat(1800, \&rss);

Jabber::Connection offers a simple way of having a function execute at regular intervals. It avoids the need for setting and re-setting alarm()s. Calling the register_beat() method takes two arguments. The first represents the interval, in seconds. The second is a reference to the function that should be invoked at each interval. Here, we're saying we want the rss() function called every half an hour.

debug("registering IQ handlers");
$c->register_handler('iq',\&iq_register);
$c->register_handler('iq',\&iq_version);
$c->register_handler('iq',\&iq_browse);
$c->register_handler('iq',\&iq_notimpl);

Most of the traffic relating to our component will be the headline messages emanating from it. However, we are expecting incoming IQ elements, particularly for registration in the jabber:iq:register namespace. We've also already mentioned that it's customary for components to honor basic "administrative" queries such as version checks. So the list of calls to the register_handler() method here reflects what we want to offer in terms of handling these IQ elements.

Whereas with Net::Jabber's SetCallBacks() function, and with JabberPy's setIqHandler() method we specify a single function to act as a handler for incoming <iq/> elements, we can specify as many handlers as we want for each element type with the register_handler() method in Jabber::Connection. The first argument refers to the element name (the name of the element's outermost tag), and the second refers to a function that will be called on receipt of an element of that name. Each of the handlers for a particular element will be called in the order they were registered. So when an <iq/> element is received over the XML stream, Jabber::Connection will dispatch it to iq_register(), then to iq_version(), then to iq_browse(), and then to iq_notimpl(). That is, unless one of those handler functions decides that the element has been handled once and for all, and that the dispatch processing for that element should stop there. In this case, that handler simply returns a special value (defined in Jabber::NS) and the dispatching stops for that element. The handlers can also cooperate, in that the dispatcher will pass whatever one handler returns, into the next handler in the list, and so on, so that you can effectively share data across handler events for a particular element, building up a complex response as you go.

This "contextual response chain" model works in a similar way to how the mod_auth_* authentication modules in JSM work. Each one that wishes to express its interest in authenticating a user adds its "stamp" to the response to an IQ-get in the jabber:iq:auth namespace, before that response is returned to the client. [9]

Authenticating handshake and launch of main loop

Once we've set up our handlers, we're ready to make the authenticating handshake. This is simply a call to the auth() method:

$c->auth('secret');

It takes one or three arguments, depending on whether the authentication is for a client or a component. Jabber::Connection decides which authentication context is required by looking at the namespace specified (or defaulted) in the connection constructor call. As we specified the namespace jabber:component:accept, the auth() method is expecting a single argument which is the secret specified in the <secret/> tag of the component instance definition. auth() performs the message digest function and sends the <handshake/> element.

It's now appropriate for us to "launch" the component, with the start() method:

$c->start;

This is the equivalent of the MainLoop() method in Perl's Tk library, and is a method from which there's no exit. Calling start() causes the connection object to perform an endless loop, which internally calls a process() method on a regular basis, receiving, examining and dispatching elements received on the XML stream. It also starts and maintains the heartbeat, to which the register_beat() method is related. [10]

Handling registration requests

The first of the handlers defined for <iq/> elements is the iq_register() function. We put it first in the list as we consider receipt of <iq/> elements in the jabber:iq:register namespace to be the most common. We want this function to deal with the complete registration conversation. This means it must respond to IQ-get and IQ-set type requests.

sub iq_register {

  my $node = shift;

  debug("[iq_register]");

The primary piece of data that the dispatcher passes to a callback is the element to be handled. We receive this into the $node variable; it's a Jabber::NodeFactory::Node object. [11] The first thing we should do is make sure it's appropriate to continue inside this function, which is only designed to handle jabber:iq:register qualified queries. The namespace jabber:iq:register is represented with the constant NS_REGISTER, imported from the Jabber::NS module.

  return unless my $query = $node->getTag('', NS_REGISTER);
  debug("--> registration request");

The getTag() method takes up to two arguments. The first can be used to specify the name of the tag you want to get. A namespace can be specified in the second argument to narrow down the request; if an element were to contain two child tags with the same name, for example, the two <x/> elements in this <message/> element here:

<message to='dj@qmacro.dyndns.org' from='piers@jabber.org' id='2941'>
  <body>Let me know when you're ready to go</body>
  <x xmlns='jabber:x:event'><displayed/></x>
  <x xmlns='jabber:x:delay' 
     from='dj@qmacro.dyndns.org'
     stamp='20010831T08:58:30'>Offline Storage</x>
</message>

We could distinguish one from the other by specifying either the jabber:x:event or the jabber:x:delay namespace.

Although normally the query tag within an <iq/> element has the name "query", we see from the section called IQ Subelements in Chapter 5 that it could be anything. So:

$node->getTag('', NS_REGISTER)

says "get a single child tag of our <iq/> node; doesn't matter what the name of the tag is, what's important is that it's qualified by the jabber:iq:register namespace."

If we call the getTag() function in scalar context, and there is more than one tag that matches, only the first one found will be returned. If we call it in list context, all the matching tags are returned. Assuming the call is successful, the variable $query then contains the <query/> tag and all its subtags. So if we received this in $node:

RECV: <iq type='set' id='JCOM_5' to='rss.qmacro.dyndns.org'
        from='dj@qmacro.dyndns.org/basement'>
        <query xmlns='jabber:iq:register'>
          <text>Slashdot</text>
        </query>
      </iq>

then $query would contain a Jabber::NodeFactory::Node object that represented this bit:

<query xmlns='jabber:iq:register'>
  <text>Slashdot</text>
</query>

If no child tag qualified by the jabber:iq:register namespace can be found, iq_register() returns, and the dispatcher calls the next handler in line—iq_version(). However, let's assume that we do have a registration IQ on our hands.

The function must handle both IQ-gets and IQ-sets. We first deal with a potential IQ-get:

  # Reg query
  if ($node->attr('type') eq IQ_GET) {
    $node = toFrom($node);
    $node->attr('type', IQ_RESULT);
    my $instructions = "Choose an RSS source from: ".join(", ", keys %sources);
    $query->insertTag('instructions')->data($instructions);
    $query->insertTag('text');
    $c->send($node);
  }

The attr() method called on a node will return the value of the node's attribute of the name specified as the first argument. We test to see if the <iq/>'s type attribute is "get" (IQ_GET). If it is, we need to return an IQ-result as shown in Example 8-12.

Rather than create a new element from scratch, to return in response, we simply "convert" the incoming element by making necessary changes to it, turn it around and sent it back out as our response. So the first thing we do is swap around the values for the from and to attributes in the <iq/> tag (in $node) by calling the toFrom() function (see the section called Helper functions), and setting the value for the type attribute to "result" by calling a 2-argument version of the attr() function, turning this:

<iq type='set' id='JCOM_5' to='rss.qmacro.dyndns.org'
  from='dj@qmacro.dyndns.org/basement'>

into this:

<iq type='result' id='JCOM_5' from='rss.qmacro.dyndns.org'
  to='dj@qmacro.dyndns.org/basement'>

Notice that we retain the from attribute; this is required as we're a component, and our response won't get stamped with one.

We must pass the instructions and an empty <text/> tag back in our response. We combine the names of the sources into a list, and insert an <instructions/> tag into our query node (in $query) containing the text. This is done with two method calls; the first to insertTag(), which returns a Jabber::NodeFactory::Node object that represents the newly inserted tag, and the second to data() which inserts (or retrieves) data into (or out of) a node. The line:

    $query->insertTag('instructions')->data($instructions);

could have been written as:

    my $instructions = $query->insertTag('instructions');
    $instructions->data($instructions);

The response, once constructed, and which now looks like this:

<iq type='result' id='JCOM_5' from='rss.qmacro.dyndns.org'
  to='dj@qmacro.dyndns.org/basement'>
  <query xmlns='jabber:iq:register'>
    <instructions>
      Choose an RSS source from: jonudell, slashdot [...]
    </instructions>
    <text/>
  </query>
</iq>

is sent using the send() method of the connection object.

If the query wasn't an IQ-get, then it might be an IQ-set:

  # Reg request
  if ($node->attr('type') eq IQ_SET) {

    # Strip JID to user@host
    my $jid = stripJID($node->attr('from'));

    $node = toFrom($node);
    my $source;

In this case, the user is requesting to receive new items for an RSS source he's specified in the <text/> field carried in the query part of the IQ-set. The user's JID can be found in the from attribute of the element, which we extract with the attr() method. But there's one thing we should do before using that JID as a key in storing that user's RSS source preferences. Look at what the JID was in the examples earlier:

dj@qmacro.dyndns.org/basement

It's a full-blown [user]@[hostname]/[resource] style JID. That's fine for using in a returning a response to an IQ request, but we need something less specific, something less "of the moment". The resource part of the JID reflects the client connection of the user at the time of registration request. In the future, when we have an RSS item to punt to him, he might be connected with a different resource. We want the RSS item to go the right place, so we use the more generic form of the JID—[user]@[hostname]—to store preferences and subsequently address our headline messages. We obtain the more generic form of the JID by calling the stripJID() function, described later.

After swapping the from and to values as before, we deal with the two different types of IQ-set requests—a request to receive a specific source, or a request to cancel registration (i.e. "unregistration"):

    # Could be an unregister
    if ($query->getTag('remove')) {
      delete $reg{$jid};
      $node->attr('type', IQ_RESULT);
    }
    # Otherwise it's a registration for a source
    elsif ($source = $query->getTag('text')->data 
           and exists($sources{$source})) {
      my $element = $reg{$jid};
      $element->{$source} = 1;
      $reg{$jid} = $element;
      $node->attr('type', IQ_RESULT);
    }
 

Sending a <remove/> tag in an IQ-set registration context represents a request to unregister. So we honour that by removing all trace of the user's JID from our registration hash, and simply changing the type of the <iq/> element to "result".

Otherwise, we interpret the IQ-set as a request to 'subscribe' to the RSS source that they've specified in the <text/> tag. We extract that source's name into $source, check that it's valid, and add a reference to the user's list of sources in the registration hash %reg. Example 8-15 shows what the registration hash looks like.

Example 8-15. Typical contents of the registration hash

(
  'dj@qmacro.dyndns.org' => {
                              'slashdot' => 1
                            }
  'piers@jabber.org'     => {
                              'jonudell' => 1
                              'slashdot' => 1
                            }
  ...
) 

In case you're wondering about the little value "dance" that's going on with the $element variable, it's because of a current restriction with MLDBM. Although it allows us to store complex structures via DB_File, we can't manipulate those structures directly, and have to do it via a "proxy" variable—$element.

Once this is done, we also mark the fact that the request was completed by setting the IQ element's type attribute to "result".

Finally, it's worth telling the requester that anything else sent just isn't cricket:

    else {
      $node->attr('type', IQ_ERROR);
      my $error = $node->insertTag('error');
      $error->attr('code', '405');
      $error->data('Not Allowed');
    }

That is, if we haven't understood what the IQ-set was—it wasn't a <remove/> request, nor was it a subscription to a source we recognise—we simply return it with an <error/> tag like this:

RECV: <iq type="set" id="jimAgentID657" to="rss.qmacro.dyndns.org">
        <query xmlns="jabber:iq:register">
          <text>banana</text>
        </query>
      </iq>

SEND: <iq id='jimAgentID657' type='error' from='rss.qmacro.dyndns.org'
          to='dj@qmacro.dyndns.org/basement'>
        <query xmlns='jabber:iq:register'>
          <text>banana</text>
        </query>
        <error code='405'>Not Allowed</error>
      </iq>

The <iq/> type is set to "error" to draw the client's attention to the <error/> tag. Sending an element back in error is a great example of where reusing an incoming element to build the outgoing response works very well; we don't have any work to reproduce what was in error, as it's already contained in what we're turning around.

In all of the IQ-set cases, we want to send something back, so we do this now:

    $c->send($node);

  }

  return r_HANDLED;

}

Note that we also return a special value r_HANDLED. The fact that we've got this far means that we received an IQ element, it was a registration-related element, and we've handled it, so there's no point in the other callbacks registered to handle IQ elements to get a look in. So we tell the dispatcher to stop the invocation chain for the element we've just processed.

Handling version requests

Now we've seen the iq_register() function, the function to handle jabber:iq:version queries looks pretty straightforward:

sub iq_version {

  my $node = shift;
  debug("[iq_version]");

  return unless my $query = $node->getTag('', NS_VERSION) 
            and $node->attr('type', IQ_GET);

  debug("--> version request");

  $node = toFrom($node);
  $node->attr('code', IQ_RESULT);
  $query->insertTag('name')->data($NAME);
  $query->insertTag('version')->data($VERSION);
  $query->insertTag('os')->data(`uname -sr`);
  $c->send($node);

  return r_HANDLED;

}

As we check for whether the element is appropriate to handle in iq_register(), so we do here, this time looking for an IQ-get with a query child tag qualified by the NS_VERSION (jabber:iq:version) namespace, which we snag into $query.

Setting the <iq/>'s type to "result" and flipping the addresses, we then just have to add <name/>, <version/>, and <os/>, tags to the query child with appropriate values, to end up with a response like the one shown in Example 8-14.

If we've done this, we deem the IQ to have been handled, and return the special r_HANDLED value to stop the dispatching going any further for this element.

Handling browse requests

Next in line to handle the incoming <iq/> element is the iq_browse() function. Of course, if we've already handled the element, iq_browse() won't even get a shot at responding. But if it did, it would proceed along similar lines to the iq_version() function:

sub iq_browse {

  my $node = shift;
  debug("[iq_browse]");

  return unless my $query = $node->getTag('', NS_BROWSE)
            and $node->attr('type', IQ_GET);

  debug("--> browse request");

  $node = toFrom($node);
  $node->attr('type', IQ_RESULT);
  my $rss = $query->insertTag('service');
  $rss->attr('type', 'rss');
  $rss->attr('jid', $ID);
  $rss->attr('name', $NAME);
  $rss->insertTag('ns')->data(NS_REGISTER);
  $c->send($node);

  return r_HANDLED;

}

The only real difference is the fact that we want this function to handle IQ-gets in the jabber:iq:browse namespace, and return a browse result. We'll be looking at browsing in more detail in the section called Browsing LDAP in Chapter 9. For now, we'll content ourselves in returning a top-level browse result that reflects what might be returned if a similar browse request were made of the JSM, as described in the section called jabber:iq:browse in Chapter 5a. What iq_browse() will return is shown in Example 8-16.

Example 8-16. RSS punter responds to jabber:iq:browse requests via iq_browse()

RECV: <iq type="get" id="browser_JCOM_2" to="rss.qmacro.dyndns.org">
        <query xmlns="jabber:iq:browse"/>
      </iq>

SEND: <iq id='browser_JCOM_2' type='result'
        to='dj@qmacro.dyndns.org/winjab'
        from='rss.qmacro.dyndns.org'>
        <query xmlns='jabber:iq:browse'>
          <service jid='rss.qmacro.dyndns.org'
               type='rss'
               name='RSS Punter'>
            <ns>jabber:iq:register</ns>
          </service>
        </query>
      </iq>

Other requests

Any other requests? If you hum it, I'll play it. But seriously, there are untold IQ elements that could be sent to the component. While it would be possible just to ignore them, we ought to do the done thing and at least respond with a "not supported" type of response. So we have iq_notimpl() as a catch-all. If the dispatcher manages to make its way through to here, we know that the iq/> element is not anything we recognise as wanting to respond to.

So let's just tell the requester that what they're asking for is not implemented:

sub iq_notimpl {

  my $node = shift;
  $node = toFrom($node);
  $node->attr('type', IQ_ERROR);
  my $error = $node->insertTag('error');
  $error->attr('code', '501');
  $error->data('Not Implemented');
  $c->send($node);

  return r_HANDLED;

}

As you can see, all this does is set the <iq/> type to "error", switches the from and to, adds an <error/> tag that looks like this:

<error code='501'>Not Implemented</error>

and throws the modified element back to the requester.

The RSS mechanism

Now we've set up the functions to handle the incoming queries, all that's left is for us to define what happens every time the heartbeat in the Jabber::Connection loop ticks past the 30 minute mark. We registered this rss() function with the register_beat() method earlier in the script:

sub rss {

  debug("[rss]");

  # Create NodeFactory
  my $nf = new Jabber::NodeFactory;

While in the IQ handlers we turned the incoming request elements around into responses, we'll actually be building elements, headline messages to be precise, from scratch here. This is why we need an instance of the Jabber::NodeFactory.

  # Go through each of the RSS sources
  foreach my $source (keys %sources) {

    # Retrieve attempt
    my $data = get($sources{$source});

    # Didn't get it? Next one
    unless (defined($data)) {
      debug("cannot retrieve $source");
      next;
    }

    # Parse the RSS
    my $rss = XML::RSS->new();
    eval { $rss->parse($data) };

    if ($@) {
      debug("problems parsing $source");
      next;
    }

The procedure in this function reflects what we described in the section called Polling the RSS sources earlier. Each time rss() is called, it goes through each of the sources defined in the list (%sources), and tries to retrieve it, with get(), a function from the LWP::Simple library, and parse it, with an instance of XML::RSS. [12] Because XML::RSS uses XML::Parser, which die()s if it encounters invalid XML, we wrap the call to the parse() method in eval.

    my @items = @{$rss->{items}};

    # Check new items
    debug("$source: looking for new items");
    foreach my $item (@items) {

      # Stop checking if we get to items already seen
      last if exists $cache{$source} and $cache{$source} eq $item->{link};

      debug("$source: new item $item->{title}");

Pulling the items from the RSS source into @items, we look through them, but stop looking if we come across one that we've seen previously (and stored in the %cache).

If we do have a new item to send out, we create a headline message containing the item's details:

      # Create a headline message
      my $msg = $nf->newNode('message');

      $msg->attr('type', 'headline');
      $msg->attr('from', join('@', $source, $ID));
      $msg->insertTag('subject')->data($item->{title});
      $msg->insertTag('body')->data($item->{description});

      my $xoob = $msg->insertTag('x', NS_XOOB);
      $xoob->insertTag('url')->data($item->{link});
      $xoob->insertTag('desc')->data($item->{description});

We use our nodefactory in $nf to create a new empty <message/> element, with the newNode() method. We build up this element into a full blown headline message with a jabber:x:oob qualified <x/> extension containing the RSS item information. We can see here that the call insertTag() used here has two arguments. The second is used to specify an optional namespace with which the new node (or tag) will be qualified. What this call creates, in $xoob, is a Jabber::NodeFactory::Node object that looks like this:

<x xmlns='jabber:x:oob'/>

This is then embellished with the usual <url/> and <desc/> tags. What's still missing is the address information. We've specified the from; indeed taking a departure from the values we've specified for the from in our IQ responses, here we specify something slightly different, with join('@', $source, $ID)— a [user]@[hostname] style address. For the slashdot source, this would be:

slashdot@rss.qmacro.dyndns.org

This is mostly because it conveys more information than just the component name rss.qmacro.dyndns.org would do. While the component's address would not normally be seen by a client user in the context of the IQ responses, many Jabber clients that support the headline message type show the message sender in the headline list display. You can see this in Figure 8-7, where the From column in the headline list shows clearly the RSS source where the item originated. This has a little bit of future for the component built in, too. If we wanted to extend the component for more interaction with the clients, we could have the client send a message to the [RSS source]@[componentname] JID and on receipt, the component would immediately have context information on which source the message was about, without the client user having to do anything other than specify a JID.

Figure 8-7. Jarl's headline display window

Now we've built our headline message, which looks like the one in Example 8-7, we can fire it off to each of the users who have registered for that RSS source:

      # Deliver to all that want it
      foreach my $jid (keys %reg) {

        my $registration = $reg{$jid};

        if (exists($registration->{$source})) {
          $msg->attr('to', $jid);
          debug("punting to $jid");
          $c->send($msg);
        }

      }

The first time we encounter an RSS source, we won't have any record of a "last seen" item in the %cache. So we avoid flooding people with all the items of a new RSS source by jumping out of the item loop if there's no cache info:

      # Prevent all items counted as new the
      # first time around
      last unless exists($cache{$source});

    }

Finally, we make a mark in the cache for the "latest" item we've just encountered, ready for next time:

    # Remember the latest new item
    $cache{$source} = $items[0]->{link};

  }

}

The cleanup() function

The cleanup() is called if an attempt is made to shut the script down; it untie()s the registration hash, ensuring no data is lost, and disconnects from the Jabber server:

sub cleanup {

  debug("cleaning up");
  untie %reg;
  $c->disconnect;
  exit;

}

Helper functions

Any script over a certain small size is bound to have helper functions; our RSS punter is no exception. Here we have the function to switch the from and to attribute values of a node (toFrom()), the function to remove the resource from a JID (stripJID()), and something not much better than a debugging-style print statement :-)

sub toFrom {
  my $node = shift;
  my $to = $node->attr('to');
  $node->attr('to', $node->attr('from'));
  $node->attr('from', $to);
  return $node;
}


sub stripJID {

  my $JID = shift;
  $JID =~ s|/.*$||;
  return $JID;

}


sub debug {

  print STDERR "debug: ", @_, "\n";

}

Further ideas

Ok, we're done! Of course, there's only so much that can be included in a demonstration script. There's plenty of scope for improvement, even if you don't count re-writing it all from scratch. You'll probably want to store the registrations in a SQL database, or alternatively using the Jabber server's own XDB component. More importantly, a static list of RSS sources is rather restrictive. How about allowing the user to register their own URLs? Or building an administrative mode which accepts a special IQ from certain JIDs, with which the RSS source list can be maintained?

The browsing response function would be an ideal candidate for extension— how about allowing a next level of browsing that would return browse items that reflect the specific user's RSS source registrations? And how could we use the power of addressing the component to include the RSS source, to extend the interactive facilities?

The script in its entirety

Here's the script in its entirety.

my $NAME     = 'RSS Punter';
my $ID       = 'rss.qmacro.dyndns.org';
my $VERSION  = '0.1';
my $reg_file = 'registrations';
my %reg;

my %cache;

my %sources = (

  'jonudell' => 'http://udell.roninhouse.com/udell.rdf',
  'slashdot' => 'http://slashdot.org/slashdot.rdf',

  # etc ...

);

tie (%reg, 'MLDBM', $reg_file) or die "Cannot tie to $reg_file: $!\n";

my $c = new Jabber::Connection(
  server    => 'localhost:5999',
  localname => $ID,
  ns        => 'jabber:component:accept',
);

unless ($c->connect()) { die "oops: ".$c->lastError; }

$SIG{HUP} = $SIG{KILL} = $SIG{TERM} = $SIG{INT} = \&cleanup;

debug("registering RSS beat");
$c->register_beat(1800, \&rss);

debug("registering IQ handlers");
$c->register_handler('iq',\&iq_register);
$c->register_handler('iq',\&iq_version);
$c->register_handler('iq',\&iq_browse);
$c->register_handler('iq',\&iq_notimpl);

$c->auth('secret');

$c->start;


sub iq_register {

  my $node = shift;

  debug("[iq_register]");
  return unless my $query = $node->getTag('', NS_REGISTER);
  debug("--> registration request");

  # Reg query
  if ($node->attr('type') eq IQ_GET) {
    $node = toFrom($node);
    $node->attr('type', IQ_RESULT);
    my $instructions = "Choose an RSS source from: ".join(", ", keys %sources);
    $query->insertTag('instructions')->data($instructions);
    $query->insertTag('text');
    $c->send($node);
  }

  # Reg request
  if ($node->attr('type') eq IQ_SET) {

    # Strip JID to user@host
    my $jid = stripJID($node->attr('from'));

    $node = toFrom($node);
    my $source;

    # Could be an unregister
    if ($query->getTag('remove')) {
      delete $reg{$jid};
      $node->attr('type', IQ_RESULT);
    }

    # Otherwise it's a registration for a source
    elsif ($source = $query->getTag('text')->data 
           and exists($sources{$source})) {
      my $element = $reg{$jid};
      $element->{$source} = 1;
      $reg{$jid} = $element;
      $node->attr('type', IQ_RESULT);
    }
 
    else {
      $node->attr('type', IQ_ERROR);
      my $error = $node->insertTag('error');
      $error->attr('code', '405');
      $error->data('Not Allowed');
    }

    $c->send($node);

  }

  return r_HANDLED;

}


sub iq_version {

  my $node = shift;
  debug("[iq_version]");

  return unless my $query = $node->getTag('', NS_VERSION) 
            and $node->attr('type', IQ_GET);

  debug("--> version request");

  $node = toFrom($node);
  $node->attr('code', IQ_RESULT);
  $query->insertTag('name')->data($NAME);
  $query->insertTag('version')->data($VERSION);
  $query->insertTag('os')->data(`uname -sr`);
  $c->send($node);

  return r_HANDLED;

}


sub iq_browse {

  my $node = shift;
  debug("[iq_browse]");

  return unless my $query = $node->getTag('', NS_BROWSE)
            and $node->attr('type', IQ_GET);

  debug("--> browse request");

  $node = toFrom($node);
  $node->attr('type', IQ_RESULT);
  my $rss = $query->insertTag('service');
  $rss->attr('type', 'rss');
  $rss->attr('jid', $ID);
  $rss->attr('name', $NAME);
  $rss->insertTag('ns')->data(NS_REGISTER);
  $c->send($node);

  return r_HANDLED;

}


sub iq_notimpl {

  my $node = shift;
  $node = toFrom($node);
  $node->attr('type', IQ_ERROR);
  my $error = $node->insertTag('error');
  $error->attr('code', '501');
  $error->data('Not Implemented');
  $c->send($node);

  return r_HANDLED;

}


sub rss {

  debug("[rss]");

  # Create NodeFactory
  my $nf = new Jabber::NodeFactory;

  # Go through each of the RSS sources
  foreach my $source (keys %sources) {

    # Retrieve attempt
    my $data = get($sources{$source});

    # Didn't get it? Next one
    unless (defined($data)) {
      debug("cannot retrieve $source");
      next;
    }

    # Parse the RSS
    my $rss = XML::RSS->new();
    eval { $rss->parse($data) };

    if ($@) {
      debug("problems parsing $source");
      next;
    }

    my @items = @{$rss->{items}};

    # Check new items
    debug("$source: looking for new items");
    foreach my $item (@items) {

      # Stop checking if we get to items already seen
      last if exists $cache{$source} and $cache{$source} eq $item->{link};

      debug("$source: new item $item->{title}");

      # Create a headline message
      my $msg = $nf->newNode('message');

      $msg->attr('type', 'headline');
      $msg->attr('from', join('@', $source, $ID));
      $msg->insertTag('subject')->data($item->{title});
      $msg->insertTag('body')->data($item->{description});

      my $xoob = $msg->insertTag('x', NS_XOOB);
      $xoob->insertTag('url')->data($item->{link});
      $xoob->insertTag('desc')->data($item->{description});

      # Deliver to all that want it
      foreach my $jid (keys %reg) {

        my $registration = $reg{$jid};

        if (exists($registration->{$source})) {
          $msg->attr('to', $jid);
          debug("punting to $jid");
          $c->send($msg);
        }

      }

      # Prevent all items counted as new the
      # first time around
      last unless exists($cache{$source});

    }

    # Remember the latest new item
    $cache{$source} = $items[0]->{link};

  }

}


sub cleanup {

  debug("cleaning up");
  untie %reg;
  $c->disconnect;
  exit;

}


sub toFrom {
  my $node = shift;
  my $to = $node->attr('to');
  $node->attr('to', $node->attr('from'));
  $node->attr('from', $to);
  return $node;
}


sub stripJID {

  my $JID = shift;
  $JID =~ s|/.*$||;
  return $JID;

}


sub debug {

  print STDERR "debug: ", @_, "\n";

}

Notes

[1]

RDF stands for Resource Description Framework.

[2]

http://www.jabbercentral.org

[3]

although a Jabber server could consist of a collection of jabberds running on separate hosts

[4]

Note that despite the tag name, you can specify an IP address or a hostname in <ip/>.

[5]

Likewise, the namespace jabber:component:exec "matches" the STDIO component connection method and the significant tag name in it's component instance definition format: (<exec/>)—see the section called STDIO in Chapter 4.

[6]

The registration process with the JSM to create a new user account uses jabber:iq:register to qualify the registration data exchanged. The registration process with the JSM to modify the account details (name, email address, and so on) also uses jabber:iq:register to qualify the account amendment data exchanged. Both types of registration request are addressed to the JSM. The key difference, which allows the JSM to distinguish between what is being requested, is that in the new user registration process, no session is active on the stream between client and server, whereas in the account amendment process, a session is active. This is also mentioned in the section called Passwords in Chapter 6.

[7]

It won't be by the time you read this, so don't try it! :-)

[8]

This isn't as bad as it seems. Take store and forward, for example, a feature provided by JSM's mod_offline module. While a message sent to a component won't be stored and forwarded if that component is not connected, a message sent from a component to a client will get stored and forwarded (if the client is offline), because the message will be routed to the JSM (because of the [hostname] in the address), which can decided what action to take—pass directly to the client if he's online, or store and forward later.

[9]

Indeed, the author of the Jabber::Connection library (ahem) has taken the (heart)beat idea, the handler chain idea, and even the low-level NodeFactory mechanisms, directly from the JSM and the server libraries, in homage to the Jabber server's classic design.

[10]

If you wish to have more granular control over your script, you can of course use the process() function directly, just as you would with the Net::Jabber and JabberPy libraries. Be aware, however, that a heartbeat is only maintained in the context of the start() method.

[11]

Jabber::NodeFactory is the "wrapper" around the class that actually represents the elements (the nodes), which is Jabber::NodeFactory::Node. Nodes are created using the Jabber::NodeFactory class.

[12]

Ideally we'd just use one instance of XML::RSS for the whole rss() function, but the way XML::RSS currently works requires us to create a new instance for every source we wish to work with.