<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://feeds.jacksleight.com/~d/styles/rss2full.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.jacksleight.com/~d/styles/itemcontent.css" type="text/css" media="screen"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
  <channel>
    <title><![CDATA[Jack Sleight's Blog]]></title>
    <link>http://jacksleight.com/</link>
    <description />
    <pubDate>Thu, 24 Jul 2008 00:06:36 +0000</pubDate>
    <generator>Zend_Feed</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.jacksleight.com/JackSleightsBlog" type="application/rss+xml" /><item>
      <title><![CDATA[Firefox 3's Buggy Text Rendering]]></title>
      <link>http://jacksleight.com/blog/2008/06/24/firefox-3s-buggy-text-rendering</link>
      <guid>http://jacksleight.com/blog/2008/06/24/firefox-3s-buggy-text-rendering</guid>
      <description>&lt;p&gt;With the recent release of &lt;a href="http://www.mozilla-europe.org/en/firefox/"&gt;Firefox 3&lt;/a&gt; I&amp;#8217;ve found that the way certain elements are rendered has changed, and in some situations not necessarily for the better. This is obviously due to the new version of the Cairo rendering engine, and no doubt, these changes are all supposed to be improvements (and mostly are). However, I&amp;#8217;ve run into one problem with text rendering in some specific situations. Take a look at this:&lt;/p&gt;

&lt;img src="/assets/blog/firefox-3s-buggy-text-rendering/legibility.png" class="center" alt="Text with optimizeLegibility rendering"&gt;

	&lt;p&gt;That&amp;#8217;s Arial set at 1.35em (from a global font size of 10px) on Windows with ClearType enabled. I know 13.5px isn&amp;#8217;t a round font size, but that&amp;#8217;s not to blame here, really that&amp;#8217;s just 14px. Set it to 1.3em, you get 13px text, set it to 1.4em and it looks the same but with slightly larger leading.&lt;/p&gt;

	&lt;p&gt;The new rendering engine is supposed to optimise the text for legibility; I&amp;#8217;d argue that that is not legible. Thankfully, Firefox 3 now supports the &lt;code&gt;text-rendering&lt;/code&gt; &lt;span class="caps"&gt;CSS&lt;/span&gt; property, which we can use to set the rendering mode to speed optimised, which actually gives much more legible text (in this situation, in my opinion). This gives us something much closer to the Firefox 2 rendering (if not identical).&lt;/p&gt;

&lt;pre class="sh_css"&gt;&lt;code class="sh_css"&gt;text-rendering: optimizeSpeed; /* The default is optimizeLegibility */
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Which gives us:&lt;/p&gt;

&lt;img src="/assets/blog/firefox-3s-buggy-text-rendering/speed.png" class="center" alt="Text with optimizeSpeed rendering"&gt;

	&lt;p&gt;As you can see, much better. I&amp;#8217;m not discounting the improvements made in the new rendering engine, but I do feel that it could still do with some work to iron out these rare, but pretty annoying issues. That can&amp;#8217;t be how the text is &lt;em&gt;supposed&lt;/em&gt; render, right?&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/318755726" height="1" width="1"/&gt;</description>
      <pubDate>Tue, 24 Jun 2008 11:00:00 +0000</pubDate>
    </item>
    <item>
      <title><![CDATA[JS_Extractor 0.1.1]]></title>
      <link>http://jacksleight.com/blog/2008/03/08/js-extractor-0-1-1</link>
      <guid>http://jacksleight.com/blog/2008/03/08/js-extractor-0-1-1</guid>
      <description>&lt;p&gt;Just a quick note to let you know I&amp;#8217;ve just released a new version of JS_Extractor. This release fixes a bug related to extracting attribute values from elements in a hierarchy.&lt;/p&gt;

	&lt;p&gt;&lt;a href="/code"&gt;Download from the Code page&lt;/a&gt;&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/250287698" height="1" width="1"/&gt;</description>
      <pubDate>Sat, 08 Mar 2008 12:00:00 +0000</pubDate>
    </item>
    <item>
      <title><![CDATA[Ready For Review]]></title>
      <link>http://jacksleight.com/blog/2008/02/23/ready-for-review</link>
      <guid>http://jacksleight.com/blog/2008/02/23/ready-for-review</guid>
      <description>&lt;p&gt;I&amp;#8217;m pleased to announce that the two &lt;a href="http://framework.zend.com/"&gt;Zend Framework&lt;/a&gt; proposals I&amp;#8217;ve been working on, &lt;a href="http://framework.zend.com/wiki/display/ZFPROP/Zend_Color+-+Jack+Sleight"&gt;Zend_Color&lt;/a&gt; and &lt;a href="http://framework.zend.com/wiki/display/ZFPROP/Zend_Db_Table_Plugin+-+Simon+Mundy%2C+Jack+Sleight"&gt;Zend_Db_Table_Plugin&lt;/a&gt; (along with Simon Mundy) are now complete and ready for community and team review. Even if you don&amp;#8217;t use the Zend Framework, the Zend_Color component can be used very easily standalone, and is intended to be a replacement for my previous Colour Tools class, so check it out.&lt;/p&gt;

	&lt;p&gt;Full details can be found on the individual proposal pages, preview code downloads for Zend_Color can be found on the &lt;a href="/code"&gt;code&lt;/a&gt; page.&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/240021492" height="1" width="1"/&gt;</description>
      <pubDate>Sat, 23 Feb 2008 17:00:00 +0000</pubDate>
    </item>
    <item>
      <title><![CDATA[JS_Extractor! And the death of Table Extractor]]></title>
      <link>http://jacksleight.com/blog/2008/02/10/js-extractor-and-the-death-of-table-extractor</link>
      <guid>http://jacksleight.com/blog/2008/02/10/js-extractor-and-the-death-of-table-extractor</guid>
      <description>&lt;p&gt;So, it&amp;#8217;s been a long time since I wrote (or even looked at) Table Extractor, and almost as soon as I wrote it I knew there were a lot of problems. For a start:&lt;/p&gt;

	&lt;ul&gt;
		&lt;li&gt;It only worked with tables&lt;/li&gt;
		&lt;li&gt;It didn&amp;#8217;t really do that properly, or at least reliably&lt;/li&gt;
		&lt;li&gt;It was a horrible mess of hacky code designed to workaround hacky &lt;span class="caps"&gt;HTML&lt;/span&gt;&lt;/li&gt;
	&lt;/ul&gt;
	&lt;ul&gt;
		&lt;li&gt;It was written for &lt;span class="caps"&gt;PHP&lt;/span&gt; &lt;em&gt;4&lt;/em&gt;, pah! Seriously, no one can still be using that, can they?&lt;/li&gt;
	&lt;/ul&gt;

	&lt;p&gt;Despite all of these problems, it was surprisingly popular, and I still regularly get emails asking how to use it, suggesting new features or reporting bugs, which although I appreciate the time people have taken, I just couldn&amp;#8217;t do anything about.&lt;/p&gt;

	&lt;p&gt;Anyway, that&amp;#8217;s in the past, and today I&amp;#8217;m releasing the first (beta) version of JS_Extractor, a brand new, completely reworked in every conceivable way, class library, designed for extracting data from &lt;span class="caps"&gt;HTML&lt;/span&gt; documents. And when I say data, I mean &lt;em&gt;any&lt;/em&gt; data, not just tables.&lt;/p&gt;

	&lt;p&gt;Before I get into the examples I want to explain the new approach I&amp;#8217;ve taken, and the various aspects of the new extractor. If you don&amp;#8217;t care about that and just want to get your hands dirty then head down to &lt;a href="#examples"&gt;the examples&lt;/a&gt;, but don&amp;#8217;t complain to me if you don&amp;#8217;t get it.&lt;/p&gt;

	&lt;h2&gt;&lt;abbr title="Document Object Model"&gt;DOM&lt;/abbr&gt; Extension and XPath&lt;/h2&gt;

	&lt;p&gt;JS_Extractor is actually an extension of the &lt;span class="caps"&gt;PHP&lt;/span&gt; &lt;a href="http://uk2.php.net/dom"&gt;&lt;abbr title="Document Object Model"&gt;DOM&lt;/abbr&gt; extension&lt;/a&gt;, and you can therefore use all of the &lt;abbr title="Document Object Model"&gt;DOM&lt;/abbr&gt; methods with any &lt;code&gt;JS_Extractor&lt;/code&gt; or &lt;code&gt;JS_Extractor_Element&lt;/code&gt; object. If you don&amp;#8217;t know or have never used the &lt;abbr title="Document Object Model"&gt;DOM&lt;/abbr&gt; extension I seriously suggest you take a quick look over the &lt;a href="http://uk2.php.net/dom"&gt;documentation&lt;/a&gt;, just so you&amp;#8217;re aware of what&amp;#8217;s possible.&lt;/p&gt;

	&lt;p&gt;The second important aspect of JS_Extractor is that it uses XPath, a lot, and adds one very useful method to the vanilla &lt;span class="caps"&gt;DOM&lt;/span&gt; classes, &lt;code&gt;query()&lt;/code&gt;. This method is really nothing more than a wrapper for a new &lt;code&gt;DOMXPath&lt;/code&gt; object, but makes it much more convenient to run XPath queries on an element. For example, rather than this:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$xpath = new DOMXPath($doc);
$nodes = $xpath-&amp;#62;query($expression, $element);
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;You can do this:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$nodes = $element-&amp;#62;query($expression);
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Which also allows you to easily chain queries together. (Yes, I stole this idea from SimpleXML.)&lt;/p&gt;

	&lt;p&gt;I&amp;#8217;m not going to cover the details of XPath here, so if you&amp;#8217;re not familiar with it already take a look a the &lt;a href="http://www.w3schools.com/xpath/xpath_syntax.asp"&gt;syntax guide from W3Schools&lt;/a&gt; (good for beginners) or the full &lt;a href="http://www.w3.org/TR/xpath"&gt;XPath spec&lt;/a&gt;. You should find the examples below fairly self explanatory though.&lt;/p&gt;

	&lt;p&gt;The &lt;abbr title="Document Object Model"&gt;DOM&lt;/abbr&gt; extensions ability to parse (even dirty) &lt;span class="caps"&gt;HTML&lt;/span&gt;, XPath support and new &lt;code&gt;query()&lt;/code&gt; method are the heart of JS_Extractor, and give you a lot of power even without the specific &amp;#8220;extractor&amp;#8221; methods. For example, you could get every link on the page and then echo the &lt;code&gt;href&lt;/code&gt; attribute values as simply as:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$extractor = new JS_Extractor(file_get_contents(&amp;#39;sample.html&amp;#39;));
foreach ($extractor-&amp;#62;query(&amp;#34;//a&amp;#34;) as $link) {
	echo $link-&amp;#62;getAttribute(&amp;#39;href&amp;#39;);
}
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Easy! Of course, the additional extractor methods make this even easier, and make other, more complicated problems easy as well.&lt;/p&gt;

	&lt;h2&gt;Utility Methods&lt;/h2&gt;

	&lt;p&gt;There are also methods available for &amp;#8220;tidying up&amp;#8221; the data before you extract it. There&amp;#8217;s actually only one of these right now, and that&amp;#8217;s &lt;code&gt;splitCells()&lt;/code&gt;, which applies to &lt;code&gt;table&lt;/code&gt;, &lt;code&gt;thead&lt;/code&gt;, &lt;code&gt;tfoot&lt;/code&gt;, and &lt;code&gt;tbody&lt;/code&gt; elements. This method will scan through the table cells and split any with a &lt;code&gt;colspan&lt;/code&gt; or &lt;code&gt;rowspan&lt;/code&gt; attribute, duplicating the content for each. This is essential for retrieving tabular data in a simple two dimensional structure.&lt;/p&gt;

	&lt;h2&gt;&lt;code&gt;extract()&lt;/code&gt;&lt;/h2&gt;

	&lt;p&gt;This does all the magic. Actually, it&amp;#8217;s really nothing more than a convenience, you can do everything this does with the standard &lt;abbr title="Document Object Model"&gt;DOM&lt;/abbr&gt; methods, but why make things more complicated than they need to be? This is primarily aimed at extracting the text within elements, in a hierarchical structure you define, or a specific attribute from a number of elements. This covers the most common uses, and anything more complicated can be achieved with the &lt;abbr title="Document Object Model"&gt;DOM&lt;/abbr&gt; and &lt;code&gt;query()&lt;/code&gt; methods.&lt;/p&gt;

	&lt;h2 id="examples"&gt;Examples&lt;/h2&gt;

	&lt;p&gt;All of these examples are run on &lt;a href="/assets/blog/js-extractor-and-the-death-of-table-extractor/sample.html"&gt;this sample data&lt;/a&gt;.&lt;/p&gt;

	&lt;p&gt;Before you start you&amp;#8217;ll need to add the library path to your &lt;span class="caps"&gt;PHP&lt;/span&gt; &lt;code&gt;include_path&lt;/code&gt; and then include the Extractor.php class file, by doing something like this:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;set_include_path(get_include_path() . PATH_SEPARATOR . &amp;#39;./library/&amp;#39;);
require_once &amp;#39;JS/Extractor.php&amp;#39;;
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Right, first we need to create the extractor object. The constructor requires a string of &lt;span class="caps"&gt;HTML&lt;/span&gt;, so use &lt;code&gt;file_get_contents()&lt;/code&gt;, or another function to retrieve the contents of a local or remote file if you need to:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$extractor = new JS_Extractor($html); // or
$extractor = new JS_Extractor(file_get_contents(&amp;#39;sample.html&amp;#39;)); // or
$extractor = new JS_Extractor(file_get_contents(&amp;#39;http://example.com/&amp;#39;));
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Next I&amp;#8217;m going to retrieve the body element of the document, this is necessary due to the way the extension of the &lt;abbr title="Document Object Model"&gt;DOM&lt;/abbr&gt; classes work. The utility and &lt;code&gt;extract()&lt;/code&gt; methods are not available on the &lt;code&gt;JS_Extractor&lt;/code&gt; object, only &lt;code&gt;JS_Extractor_Element&lt;/code&gt; objects.&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$body = $extractor-&amp;#62;query(&amp;#34;body&amp;#34;)-&amp;#62;item(0);
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;As you can see, the &lt;code&gt;query()&lt;/code&gt; method returns a &lt;code&gt;DOMNodeList&lt;/code&gt; rather than a single element, therefore if you only need one element you have to call the &lt;code&gt;item()&lt;/code&gt; method.&lt;/p&gt;

	&lt;h3&gt;Tables&lt;/h3&gt;

	&lt;p&gt;Now we grab the first table in the body, like this:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$table = $body-&amp;#62;query(&amp;#34;//table&amp;#34;)-&amp;#62;item(0);
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;As far as selecting the right table goes, you&amp;#8217;re only limited to what you can do with XPath, which is pretty much anything. I&amp;#8217;ll cover some more examples on selecting elements later on.&lt;/p&gt;

	&lt;p&gt;Now before we start extracting data from this table we need to clean it up, because this table has &lt;code&gt;colspan&lt;/code&gt;s and &lt;code&gt;rowspan&lt;/code&gt;s, which need splitting and duplicating in order to create a simple two dimensional structure, this is as easy as:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$table-&amp;#62;splitCells();
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Right, let&amp;#8217;s extract all the cell data from the rows in the &lt;code&gt;tbody&lt;/code&gt;, grouped by row:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$data = $table-&amp;#62;extract(array(&amp;#34;tbody/tr&amp;#34;, &amp;#34;td&amp;#34;));
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;As you can see, the first argument of the &lt;code&gt;extract()&lt;/code&gt; method is an array of XPath expressions. These define the hierarchical structure of the array that is returned. What you&amp;#8217;re saying here is: get all the &lt;code&gt;tr&lt;/code&gt; elements from the &lt;code&gt;tbody&lt;/code&gt; element, and then get the text from all the &lt;code&gt;td&lt;/code&gt; elements within those. This will return something like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;array
  0 =&amp;#62; 
    array
      0 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
      1 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
      2 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
      3 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
  1 =&amp;#62; 
    array
      0 =&amp;#62; string &amp;#39;B&amp;#39; (length=1)
      1 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
      2 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
      3 =&amp;#62; string &amp;#39;B&amp;#39; (length=1)
  2 =&amp;#62; 
    array
      0 =&amp;#62; string &amp;#39;C&amp;#39; (length=1)
      1 =&amp;#62; string &amp;#39;C&amp;#39; (length=1)
      2 =&amp;#62; string &amp;#39;C&amp;#39; (length=1)
      3 =&amp;#62; string &amp;#39;C&amp;#39; (length=1)
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Great! Now how about all &lt;code&gt;tr&lt;/code&gt;s rather than just the ones in the &lt;code&gt;tbody&lt;/code&gt;? You &lt;em&gt;could&lt;/em&gt; do:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$data = $table-&amp;#62;extract(array(&amp;#34;.//tr&amp;#34;, &amp;#34;td&amp;#34;));
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;But you&amp;#8217;ll run into a problem, because the &lt;code&gt;thead&lt;/code&gt; contains &lt;code&gt;th&lt;/code&gt; elements rather than &lt;code&gt;td&lt;/code&gt;s, so instead we need to do:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$data = $table-&amp;#62;extract(array(&amp;#34;.//tr&amp;#34;, &amp;#34;th|td&amp;#34;));
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;But now we have another problem. We have all the rows, but no idea which ones came from the &lt;code&gt;thead&lt;/code&gt;, &lt;code&gt;tbody&lt;/code&gt; and &lt;code&gt;tfoot&lt;/code&gt;, which is kinda important. The first thing we need to do is separate the &lt;code&gt;tr&lt;/code&gt; expression into individual parts for the &lt;code&gt;thead&lt;/code&gt;, &lt;code&gt;tbody&lt;/code&gt; and &lt;code&gt;tfoot&lt;/code&gt;, like so:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$data = $table-&amp;#62;extract(array(
	array(&amp;#34;thead/tr&amp;#34;, &amp;#34;tbody/tr&amp;#34;, &amp;#34;tfoot/tr&amp;#34;),
	&amp;#34;th|td&amp;#34;,
));
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Then we need to name these parts so that when the array comes back the rows are grouped into the right section:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$data = $table-&amp;#62;extract(array(
	array(&amp;#39;head&amp;#39; =&amp;#62; &amp;#34;thead/tr&amp;#34;, &amp;#39;foot&amp;#39; =&amp;#62; &amp;#34;tfoot/tr&amp;#34;, &amp;#39;body&amp;#39; =&amp;#62; &amp;#34;tbody/tr&amp;#34;),
	&amp;#34;th|td&amp;#34;,
));
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;This will give you something like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;array
  &amp;#39;head&amp;#39; =&amp;#62; 
    array
      0 =&amp;#62; 
        array
          0 =&amp;#62; string &amp;#39;H&amp;#39; (length=1)
          1 =&amp;#62; string &amp;#39;H&amp;#39; (length=1)
          2 =&amp;#62; string &amp;#39;H&amp;#39; (length=1)
          3 =&amp;#62; string &amp;#39;H&amp;#39; (length=1)
  &amp;#39;foot&amp;#39; =&amp;#62; 
    array
      0 =&amp;#62; 
        array
          0 =&amp;#62; string &amp;#39;F&amp;#39; (length=1)
          1 =&amp;#62; string &amp;#39;F&amp;#39; (length=1)
          2 =&amp;#62; string &amp;#39;F&amp;#39; (length=1)
          3 =&amp;#62; string &amp;#39;F&amp;#39; (length=1)
  &amp;#39;body&amp;#39; =&amp;#62; 
    array
      0 =&amp;#62; 
        array
          0 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
          1 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
          2 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
          3 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
      1 =&amp;#62; 
        array
          0 =&amp;#62; string &amp;#39;B&amp;#39; (length=1)
          1 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
          2 =&amp;#62; string &amp;#39;A&amp;#39; (length=1)
          3 =&amp;#62; string &amp;#39;B&amp;#39; (length=1)
      2 =&amp;#62; 
        array
          0 =&amp;#62; string &amp;#39;C&amp;#39; (length=1)
          1 =&amp;#62; string &amp;#39;C&amp;#39; (length=1)
          2 =&amp;#62; string &amp;#39;C&amp;#39; (length=1)
          3 =&amp;#62; string &amp;#39;C&amp;#39; (length=1)
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;And that&amp;#8217;s it! Of course this is in no way limited to tables, and below are some further examples of extracting data from other elements:&lt;/p&gt;

	&lt;h3&gt;Lists (&lt;code&gt;ul&lt;/code&gt;, &lt;code&gt;ol&lt;/code&gt;)&lt;/h3&gt;

	&lt;p&gt;Here we get the &lt;code&gt;ul&lt;/code&gt; element with the id &amp;#8220;list&amp;#8221;, and then extract the text from each &lt;code&gt;li&lt;/code&gt;:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$list = $body-&amp;#62;query(&amp;#34;//ul[@id=&amp;#39;list&amp;#39;]&amp;#34;)-&amp;#62;item(0);
$data = $list-&amp;#62;extract(&amp;#34;li&amp;#34;);
&lt;/code&gt;&lt;/pre&gt;

	&lt;h3&gt;Custom Markup&lt;/h3&gt;

	&lt;p&gt;You can even extract data from custom markup based on &lt;code&gt;div&lt;/code&gt;s and &lt;code&gt;span&lt;/code&gt;s (or any other element type):&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$data = $body-&amp;#62;extract(array(
	&amp;#34;div[@class=&amp;#39;article&amp;#39;]&amp;#34;,
	array(&amp;#39;title&amp;#39; =&amp;#62; &amp;#34;h2&amp;#34;, &amp;#39;date&amp;#39; =&amp;#62; &amp;#34;span[@class=&amp;#39;date&amp;#39;]&amp;#34;, &amp;#39;body&amp;#39; =&amp;#62; &amp;#34;p&amp;#34;),
));
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;What you&amp;#8217;re doing here is getting all the &lt;code&gt;div&lt;/code&gt; elements with a class of &amp;#8220;article&amp;#8221;, and then extracting the title, date and body text from the relevant elements.&lt;/p&gt;

	&lt;h3&gt;Attribute Data&lt;/h3&gt;

	&lt;p&gt;The attribute extraction method returns the value of a specified attribute, rather than the elements text content. Although not demonstrated here, the hierarchical structure feature works in exactly the same way with the attribute extraction. Here we&amp;#8217;re going to get all the &lt;code&gt;href&lt;/code&gt; values of all &lt;code&gt;a&lt;/code&gt; elements in the body:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$urls = $body-&amp;#62;extract(&amp;#34;.//a&amp;#34;, JS_Extractor::EXTRACT_ATTRIBUTE, &amp;#39;href&amp;#39;);
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Or alternatively, all &lt;code&gt;href&lt;/code&gt; values of any element in the body that has an &lt;code&gt;href&lt;/code&gt; attribute:&lt;/p&gt;

&lt;pre class="sh_php"&gt;&lt;code class="sh_php"&gt;$urls = $body-&amp;#62;extract(&amp;#34;.//*[@href]&amp;#34;, JS_Extractor::EXTRACT_ATTRIBUTE, &amp;#39;href&amp;#39;);
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;That covers everything I wanted to demonstrate today. I plan to post further, more specific examples in the future, so if you have any requests let me know.&lt;/p&gt;

	&lt;h2&gt;Download&lt;/h2&gt;

	&lt;p&gt;Please bear in mind that this is a beta version, and therefore the &lt;abbr title="Application Programming Interface"&gt;API&lt;/abbr&gt; may change in future releases.&lt;/p&gt;

	&lt;p&gt;&lt;a href="/code"&gt;Download from the Code page&lt;/a&gt;&lt;/p&gt;

	&lt;p&gt;I will also be posting the full &lt;abbr title="Application Programming Interface"&gt;API&lt;/abbr&gt; documentation in the near future.&lt;/p&gt;

	&lt;h2&gt;What about Table Extractor?&lt;/h2&gt;

	&lt;p&gt;Well, this post marks the death of Table Extractor, anyone using it should start using JS_Extractor. If you&amp;#8217;re on &lt;span class="caps"&gt;PHP&lt;/span&gt; 4 then upgrade, seriously, there&amp;#8217;s no good reason not to, and many reasons to do so. If you don&amp;#8217;t have the &lt;abbr title="Document Object Model"&gt;DOM&lt;/abbr&gt; extension enabled then enable it, I mean come on, it comes with &lt;span class="caps"&gt;PHP&lt;/span&gt; and is enabled by default anyway.&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/232671119" height="1" width="1"/&gt;</description>
      <pubDate>Sun, 10 Feb 2008 12:00:00 +0000</pubDate>
    </item>
    <item>
      <title><![CDATA[My HTML 5 Favourites]]></title>
      <link>http://jacksleight.com/blog/2008/01/24/my-html-5-favourites</link>
      <guid>http://jacksleight.com/blog/2008/01/24/my-html-5-favourites</guid>
      <description>&lt;p&gt;With the W3C publishing the latest &lt;a href="http://www.w3.org/TR/html5/"&gt;&lt;span class="caps"&gt;HTML&lt;/span&gt; 5 working draft&lt;/a&gt; and supporting &lt;a href="http://www.w3.org/TR/html5-diff/"&gt;differences from &lt;span class="caps"&gt;HTML&lt;/span&gt; 4&lt;/a&gt; document, I thought I&amp;#8217;d list some of my favourite additions and changes:&lt;/p&gt;

	&lt;h2&gt;New Elements&lt;/h2&gt;

	&lt;ul&gt;
		&lt;li&gt;&lt;code&gt;figure&lt;/code&gt; to associate a caption with some content, such as an image.&lt;/li&gt;
		&lt;li&gt;&lt;code&gt;audio&lt;/code&gt; and &lt;code&gt;video&lt;/code&gt; can be used for multimedia content.&lt;/li&gt;
		&lt;li&gt;&lt;code&gt;canvas&lt;/code&gt; can be used for rendering dynamic graphics such as graphs.&lt;/li&gt;
		&lt;li&gt;&lt;code&gt;datagrid&lt;/code&gt; for interactive tabular data.&lt;/li&gt;
	&lt;/ul&gt;
	&lt;ul&gt;
		&lt;li&gt;The &lt;code&gt;input&lt;/code&gt; element now has some new types: &lt;code&gt;datetime&lt;/code&gt;, &lt;code&gt;datetime-local&lt;/code&gt;, &lt;code&gt;date&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, &lt;code&gt;week&lt;/code&gt;, &lt;code&gt;time&lt;/code&gt;, &lt;code&gt;number&lt;/code&gt;, &lt;code&gt;range&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt; and &lt;code&gt;url&lt;/code&gt;.&lt;/li&gt;
	&lt;/ul&gt;

	&lt;h2&gt;New Attributes&lt;/h2&gt;

	&lt;ul&gt;
		&lt;li&gt;The &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;area&lt;/code&gt; elements have a new attribute called &lt;code&gt;ping&lt;/code&gt; that specifies a list of URIs which have to be pinged when the hyperlink is followed.&lt;/li&gt;
		&lt;li&gt;The new &lt;code&gt;autofocus&lt;/code&gt; attribute can be specified to focus a form control during page load.&lt;/li&gt;
		&lt;li&gt;The new &lt;code&gt;form&lt;/code&gt; attribute allows for controls to be associated with multiple forms.&lt;/li&gt;
		&lt;li&gt;The new &lt;code&gt;required&lt;/code&gt; attribute on form controls indicates that the user has to fill in a value in order to submit the form.&lt;/li&gt;
	&lt;/ul&gt;
	&lt;ul&gt;
		&lt;li&gt;The &lt;code&gt;input&lt;/code&gt; element has several new attributes to specify constraints: &lt;code&gt;autocomplete&lt;/code&gt;, &lt;code&gt;min&lt;/code&gt;, &lt;code&gt;max&lt;/code&gt;, &lt;code&gt;pattern&lt;/code&gt; and &lt;code&gt;step&lt;/code&gt;.&lt;/li&gt;
	&lt;/ul&gt;

	&lt;p&gt;These are just the specific bits I&amp;#8217;m most interested in, take a look at the full &lt;a href="http://www.w3.org/TR/html5-diff/"&gt;&lt;span class="caps"&gt;HTML&lt;/span&gt; 5 differences from &lt;span class="caps"&gt;HTML&lt;/span&gt; 4&lt;/a&gt; document for the full list of new features/changes, in more detail.&lt;/p&gt;

	&lt;p&gt;Of course, none of this matters for now, as it&amp;#8217;s going to be a &lt;em&gt;long&lt;/em&gt; time before we get to use any &lt;span class="caps"&gt;HTML&lt;/span&gt; 5 goodness in real applications and sites, but it&amp;#8217;s good to see things moving forward.&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/222249387" height="1" width="1"/&gt;</description>
      <pubDate>Thu, 24 Jan 2008 12:00:00 +0000</pubDate>
    </item>
    <item>
      <title><![CDATA[X-UA-Compatible in the Future?]]></title>
      <link>http://jacksleight.com/blog/2008/01/24/x-ua-compatible-in-the-future</link>
      <guid>http://jacksleight.com/blog/2008/01/24/x-ua-compatible-in-the-future</guid>
      <description>&lt;p&gt;After &lt;a href="http://blogs.msdn.com/ie/archive/2008/01/21/compatibility-and-ie8.aspx"&gt;all&lt;/a&gt; &lt;a href="http://alistapart.com/articles/beyonddoctype"&gt;the&lt;/a&gt; &lt;a href="http://alistapart.com/articles/fromswitchestotargets"&gt;talk&lt;/a&gt; &lt;a href="http://meyerweb.com/eric/thoughts/2008/01/23/version-two/"&gt;about&lt;/a&gt; the IE8 &amp;#8220;super standards mode&amp;#8221; (something that should really be called &amp;#8220;the standard, normal, default mode&amp;#8221;) opt-in, I&amp;#8217;ve put down &lt;a href="http://meyerweb.com/eric/thoughts/2008/01/23/version-two/?#comment-305000"&gt;my&lt;/a&gt; &lt;a href="http://meyerweb.com/eric/thoughts/2008/01/23/version-two/?#comment-305061"&gt;thoughts&lt;/a&gt; in the comments on Eric Myer&amp;#8217;s latest blog post. The gist being that if we have to have it at all, it should be a one off, short term solution specifically for the jump from IE7 to IE8.&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/222219630" height="1" width="1"/&gt;</description>
      <pubDate>Thu, 24 Jan 2008 11:00:00 +0000</pubDate>
    </item>
    <item>
      <title><![CDATA[IE Floated Columns Totaling 100% Bug]]></title>
      <link>http://jacksleight.com/blog/2008/01/20/ie-floated-columns-totaling-100-percent-bug</link>
      <guid>http://jacksleight.com/blog/2008/01/20/ie-floated-columns-totaling-100-percent-bug</guid>
      <description>&lt;p&gt;I feel I&amp;#8217;ve been blogging a lot about IE bugs so far, which I guess I have; but then, I wouldn&amp;#8217;t have to if there weren&amp;#8217;t so many! Anyway, this particular bug has no doubt been worked around in many of the column based layout templates available on the web, it may have even been worked around in this exact way, but I didn&amp;#8217;t find it. And when specifically Googling for it nothing of any use came up, so hopefully this should help some people out.&lt;/p&gt;

	&lt;p&gt;If you&amp;#8217;ve ever tried to create a liquid column layout by floating a set of divs next to each other, with a total width of 100%, then you&amp;#8217;ve no doubt run into this problem, where IE calculates the total width as slightly more than 100%, and ends up wrapping the last column below the others. &lt;del&gt;I assume it&amp;#8217;s some kind of rounding error.&lt;/del&gt; &lt;ins&gt;It&amp;#8217;s due to the way IE rounds sub-pixel widths.&lt;/ins&gt;&lt;/p&gt;

	&lt;p&gt;&lt;a href="/assets/blog/ie-floated-columns-totaling-100-percent-bug/problem.html"&gt;See Problem&lt;/a&gt; (if it looks right, try adjusting the browser window width)&lt;/p&gt;

	&lt;p&gt;It&amp;#8217;s a pain, but I&amp;#8217;ve found a way to fix it using negative margins (again); I&amp;#8217;m staring to think negative margins are IE&amp;#8217;s best friend. The fix is simple, just apply a negative right margin to the last column, like so:&lt;/p&gt;

&lt;pre class="sh_css"&gt;&lt;code class="sh_css"&gt;#right	{ margin-right: -2px; }
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;&lt;ins&gt;The actual number of pixels you should set it to varies depending on the number of columns you have, but I find a reliable rule to go by is equal to or greater than half the total number of columns. So if you have three columns, set it to two, if you have ten, set it to five, and so on.&lt;/ins&gt;&lt;/p&gt;

	&lt;p&gt;This accommodates for the additional width and stops the last column wrapping. It also has no absolutely no effect visually (the last column remains the same width, and stays in the same position). &lt;del&gt;I tried this with 1 pixel but that wasn&amp;#8217;t quite enough to fix it completely. I&amp;#8217;ve had no wrapping whatsoever with 2 pixels.&lt;/del&gt;&lt;/p&gt;

	&lt;p&gt;&lt;a href="/assets/blog/ie-floated-columns-totaling-100-percent-bug/solution.html"&gt;See Solution&lt;/a&gt;&lt;/p&gt;

	&lt;p&gt;This problem exists in both IE6 and IE7, this solution works in both.&lt;/p&gt;

	&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; John Resig has &lt;a href="http://ejohn.org/blog/sub-pixel-problems-in-css/"&gt;posted an excellent article&lt;/a&gt; describing what causes this and the problems browsers have in calculating the correct widths. I&amp;#8217;ve also learnt that this can&amp;#8217;t technically be considered an IE &amp;#8220;bug&amp;#8221;, as the behaviour in this situation isn&amp;#8217;t actually standardized. Of course, that doesn&amp;#8217;t mean IE&amp;#8217;s behaviour is right.&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/220021638" height="1" width="1"/&gt;</description>
      <pubDate>Sun, 20 Jan 2008 15:00:00 +0000</pubDate>
    </item>
    <item>
      <title><![CDATA[IE7 Dotted Borders: Close, But Not Close Enough]]></title>
      <link>http://jacksleight.com/blog/2008/01/16/ie7-dotted-borders-close-but-not-close-enough</link>
      <guid>http://jacksleight.com/blog/2008/01/16/ie7-dotted-borders-close-but-not-close-enough</guid>
      <description>&lt;p&gt;As if we needed any more IE bugs, I&amp;#8217;ve just run into another. Although dotted borders were &lt;a href="http://blogs.msdn.com/ie/archive/2006/08/22/712830.aspx"&gt;supposedly fixed in IE7&lt;/a&gt;, it turns out that they still render as dashed if applied to a fieldset with a legend (which is of course required). There&amp;#8217;s no workaround as far as I know.&lt;/p&gt;

	&lt;p&gt;&lt;a href="/assets/blog/ie7-dotted-borders-close-but-not-close-enough/example.html"&gt;Example&lt;/a&gt;&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/217855328" height="1" width="1"/&gt;</description>
      <pubDate>Wed, 16 Jan 2008 21:00:00 +0000</pubDate>
    </item>
    <item>
      <title><![CDATA[Three IE Form CSS Problems & Solutions]]></title>
      <link>http://jacksleight.com/blog/2008/01/15/three-ie-form-css-problems-and-solutions</link>
      <guid>http://jacksleight.com/blog/2008/01/15/three-ie-form-css-problems-and-solutions</guid>
      <description>&lt;p&gt;I was recently writing some &lt;span class="caps"&gt;CSS&lt;/span&gt; for a form, and ran into a few IE problems/bugs which I&amp;#8217;ve either not found documented elsewhere, or at least not found my solutions mentioned, so I figured it would be beneficial to post them here.&lt;/p&gt;

	&lt;h2&gt;Checkbox/Radio Margins&lt;/h2&gt;

	&lt;p&gt;I&amp;#8217;m not entirely sure if this should be considered a bug, or if it&amp;#8217;s just simply the way IE likes to render these elements, but as you may well know, both of these elements appear to have a three or four pixel margin around them, but it can&amp;#8217;t be removed with &lt;code&gt;margin: 0;&lt;/code&gt;. There are a few suggestions around for fixing this, but I didn&amp;#8217;t find any using negative margins, which as it happens, work perfectly, like so:&lt;/p&gt;

&lt;pre class="sh_css"&gt;&lt;code class="sh_css"&gt;input.checkbox, 
input.radio	{ margin: -4px -3px -3px -4px; }
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;&lt;a href="/assets/blog/three-ie-form-css-problems-and-solutions/checkbox-radio-margins/problem.html"&gt;See Problem&lt;/a&gt;, &lt;a href="/assets/blog/three-ie-form-css-problems-and-solutions/checkbox-radio-margins/solution.html"&gt;See Solution&lt;/a&gt;&lt;/p&gt;

	&lt;p&gt;This problem exists in both IE6 and IE7, and this solution works in both.&lt;/p&gt;

	&lt;h3&gt;Considerations&lt;/h3&gt;

	&lt;p&gt;Remeber that reducing the margins removes the clickable white-space around the element, potentially reducing usability.&lt;/p&gt;

	&lt;h2&gt;Legend Left Margin/Padding&lt;/h2&gt;

	&lt;p&gt;Again, bug or feature, who knows? But by default you get a seven pixel gap on the left hand  side of legend element text (in addition to the fieldset left padding), which can&amp;#8217;t be removed by setting the margin or padding to zero. However, negative margins to the rescue again and we can fix that:&lt;/p&gt;

&lt;pre class="sh_css"&gt;&lt;code class="sh_css"&gt;legend	{ margin-left: -7px; }
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;&lt;a href="/assets/blog/three-ie-form-css-problems-and-solutions/legend-left-margin-padding/problem.html"&gt;See Problem&lt;/a&gt;, &lt;a href="/assets/blog/three-ie-form-css-problems-and-solutions/legend-left-margin-padding/solution.html"&gt;See Solution&lt;/a&gt;&lt;/p&gt;

	&lt;p&gt;This problem exists in both IE6 and IE7, and this solution works in both.&lt;/p&gt;

	&lt;h2&gt;Fieldset &lt;code&gt;border-top: 0;&lt;/code&gt; and Legend &lt;code&gt;display: none;&lt;/code&gt;&lt;/h2&gt;

	&lt;p&gt;This is a fairly specific problem that most people probably never run into, but I had a couple of fieldsets which I wanted to hide the legends for and ran into a problem. When you have a fieldset with borders, but no top border, and the legend set to display none you &lt;strong&gt;still get a top border&lt;/strong&gt;. Make the legend visible and no problem, the border goes away. The only way I found to fix this was by setting the border colour to the same as the fieldsets background colour, which is in no way ideal.&lt;/p&gt;

	&lt;p&gt;&lt;a href="/assets/blog/three-ie-form-css-problems-and-solutions/fieldset-border-top-0-legend-display-none/problem.html"&gt;See Problem&lt;/a&gt;, &lt;a href="/assets/blog/three-ie-form-css-problems-and-solutions/fieldset-border-top-0-legend-display-none/solution.html"&gt;See Solution&lt;/a&gt;&lt;/p&gt;

	&lt;p&gt;This problem exists in both IE6 and IE7, and this solution works in both.&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/217668222" height="1" width="1"/&gt;</description>
      <pubDate>Tue, 15 Jan 2008 11:00:00 +0000</pubDate>
    </item>
    <item>
      <title><![CDATA[Archive & Tags Added]]></title>
      <link>http://jacksleight.com/blog/2008/01/14/archive-and-tags-added</link>
      <guid>http://jacksleight.com/blog/2008/01/14/archive-and-tags-added</guid>
      <description>&lt;p&gt;So, it took me less than a day to realise I needed a better archiving/category system for the blog. So I&amp;#8217;ve added one. You can now view posts by tag (via the blog home page) or through the archive via the date links under each post title (or by just guessing the &lt;span class="caps"&gt;URL&lt;/span&gt; if you really want).&lt;/p&gt;&lt;img src="http://feeds.jacksleight.com/~r/JackSleightsBlog/~4/217668224" height="1" width="1"/&gt;</description>
      <pubDate>Mon, 14 Jan 2008 17:00:00 +0000</pubDate>
    </item>
  </channel>
</rss>
