Forum Replies Created

Viewing 15 replies - 1 through 15 (of 20 total)
  • Thread Starter Mark Tuttle

    (@markrtuttle)

    I see. That is not my experience, but I understand your explanation. It would ordinarily be a very minor issue, except that I import large sites subtree by subtree over extended periods of time as I and other volunteers hack the archaic html to meet modern standards before importing. Perhaps the most elegant solution is to insert a line into the documentation (even into the configuration page?) mentioning this distinction so the next guy like me (if there ever is one) is not surprised.

    I did not appreciate the consequence of this problem until I tried importing files this month using the tag ‘body’ to select the whole web page for importing. Now the body tag appears in post_content in the database, and now WordPress is generating invalid html with one ‘body’ tag nested inside the WordPress ‘body’ tag.

    Thread Starter Mark Tuttle

    (@markrtuttle)

    Looking at method get_post($path,$placeholder) in class HTML_Import in html-importer.php

    // if we're doing hierarchicals and this is an index file of a
    // subdirectory, instead of importing this as a separate page, update
    // the content of the placeholder page we created for the directory
    if (is_post_type_hierarchical($options['type']) &&
        dirname($path) != $options['root_directory'] &&
        basename($path) == $options['index_file']) {
    	$post_id = array_search(dirname($path), $this->filearr);
    	if ($post_id !== 0)
    		$updatepost = true;
    }
    
    if ($updatepost) {
    	$my_post['ID'] = $post_id;
    	wp_update_post( $my_post );
    }
    else // insert new post
    	$post_id = wp_insert_post($my_post);

    it seems that files in the root directory are not made children of the index file in the root directory, because no placeholder post gets made for the root directory, so there is no existing post to update with wp_update_post. Am I reading this correctly? So by design the hierarchy in for root directory must be constructed manually?

    I would try listing “class” as one of the allowed attributes under the “clean up html” section of the import tool.

    Mark Tuttle

    (@markrtuttle)

    What do you see when you enter http://www.site.com/?page_id=435 directly into your browser? Do you see the url rewritten to the permalink you are expecting?

    This sounds like a problem with permalinks configuration to me and not this plugin. Have you looked at http://codex.wordpress.org/Using_Permalinks? Is your web server configured as described there under “Permalink Types” (eg, apache with mod_rewrite loaded)? Have you tried other permalink settings like “Month and name” or “Post name” in place of the “Custom Structure” you seem to be using, just for debugging purposes?

    If you are not already comfortable with php scripting, and since your target is a single directory of files, it is probably faster and safer to bite the bullet and change the slugs manually one page at a time (dashboard -> pages -> all pages -> edit). A single evening of boring manual labor in front of the television will probably get the job done, and then it will be over.

    I suspect you misspecified the html element containing the content of the pages under dashboard->settings->html import. See another post where Stephanie addresses this in the context of the importer grabbing only the page titles.

    I’ve used the plugin with 3.3.1 (I saw a schedule suggesting 3.4 is coming out in April). What didn’t work?

    I’m not the plugin author, but

    1. I think selecting the entire node <div id=”content”>…</div> is a reasonable design decision. Can you say why this is a problem? Do you now have two document nodes with the same id “content”? Is this as simple as modifying your theme to omit the extra <div id=”content”> </div> wrapper?

    2. For the duplicate title, I suspect this is not a problem with the plugin. I suspect the problem is that your static pages (as mine did) contain both <title>title string</title> and <h1>title string</h1> and most WordPress themes repeat the title at the top of the body with a line like

    <h1 class="entry-title"><?php the_title(); ?></h1>

    So one quick solution is just to delete this line from the theme files. It is also possible to write a small script to iterate over the pages in the database to strip the initial <h1>…</h1> element from $page[‘post_content’] for each $page in the database.

    I’m not the plugin author, but a subsequent poster asked a similar question.

    I have manually patched slugs to match filenames by computing the mapping $id->$filename from post id to filename, and then writing a script to essentially

    $post = get_page($id);
    $post['post_name'] = $filename;
    wp_update_post($post);

    By the way, your request is reasonable, but I’m not certain it is always possible. There seems to be some requirement that slugs are unique (although I’ve gotten away with slugs that are not unique for reasons I can’t explain), so if you have two files with the same name FILENAME you might end up with posts having slugs FILENAME and FILENAME-2.

    I’m not the plugin author, but I believe the HTML_Import class in html-importer.php defines a method get_post that reads the file and creates the page. This class builds up a WordPress post object in the array $my_post, without explicitly specifying the slug, and then inserts the post into the database with the lines

    if ($updatepost) {
      $my_post['ID'] = $post_id;
      wp_update_post( $my_post );
    }
    else // insert new post
      $post_id = wp_insert_post($my_post);

    I believe these functions wp_update_post and wp_insert_post generate the slug from the title when the slug is not specified. So if you know how to compute the slug you want to use, then setting the slug to $slug should be as easy as adding

    $my_post['post_name'] = $slug;

    before the post insertion code.

    I’m not the plugin author, but the only instance of fopen that I can find in the latest release is in html-importer.php:

    $contents = @fopen($path);  // read entire file
    if (empty($contents))
      $contents = @file_get_contents($path);

    I think the fopen fails, the at-sign suppresses the error message, and the file_get_contents actually reads the file.

    Thread Starter Mark Tuttle

    (@markrtuttle)

    To strip the cdata, script, and style blocks, I think it is sufficient to add the functions

    function allowed_tag($tag,$allowedtags=NULL) {
      return
        !is_null($allowedtags) &&
        stripos($allowedtags,$tag) !== false;
    }
    
    function strip_cdata_block($string,$allowedtags=NULL) {
      if ($this->allowed_tag('<cdata>',$allowedtags)) return $string;
    
      $delim = "@";
      $cdata_start = preg_quote('<![CDATA[',$delim);
      $cdata_end = preg_quote(']]>',$delim);
      $block = "$cdata_start.*?$cdata_end";
    
      return preg_replace("${delim}$block${delim}s","",$string);
    }
    
    function strip_tag_block($tag,$string,$allowedtags=NULL) {
      if ($this->allowed_tag($tag,$allowedtags)) return $string;
      if (!preg_match(":<(.*?)>:",$tag,$match)) return $string;
    
      $delim = "@";
      $tag_str = $match[1];
      $tag_start = "<$tag_str(?:>|\\s[^>]*>)";
      $tag_end   = "</$tag_str(?:>|\\s[^>]*>)";
      $block = "$tag_start.*?$tag_end";
    
      return preg_replace("${delim}$block${delim}is","",$string);
    }
    
    function strip_comment_block($string) {
      $delim = "@";
      $comment_start = preg_quote('<!--',$delim);
      $comment_end = preg_quote('-->',$delim);
      $block = "$comment_start.*?$comment_end";
    
      return preg_replace("${delim}$block${delim}s","",$string);
    }

    and add the following calls before strip_tags at the head of clean_html:

    $string = $this->strip_cdata_block($string,$allowtags);
    $string = $this->strip_tag_block('<script>',$string,$allowtags);
    $string = $this->strip_tag_block('<style>',$string,$allowtags);
    $string = $this->strip_comment_block($string);
    Thread Starter Mark Tuttle

    (@markrtuttle)

    I propose adding to the HTML_Import class defined in html-importer.php the function

    function strip_insignificant_html_whitespace($string) {
      $pre_start = "<pre(?:>|\\s[^>]*>)";
      $pre_end   = "</pre(?:>|\\s[^>]*>)";
    
      $old_parts = preg_split(";($pre_start|$pre_end);i",$string,0,PREG_SPLIT_DELIM_CAPTURE);
      $new_parts = array();
    
      $strip = true;
      foreach ($old_parts as $part) {
        if (preg_match(";$pre_start;i",$part)) {
          $tmp = preg_replace(";\s+;"," ",$part);
          $new_parts[] = preg_replace("; +>;",">",$tmp);
          $strip = false;
          continue;
        }
        if (preg_match(";$pre_end;i",$part)) {
          $tmp = preg_replace(";\s+;"," ",$part);
          $new_parts[] = preg_replace("; +>;",">",$tmp);
          $strip = true;
          continue;
        }
        if ($strip)
          $new_parts[] = preg_replace(";\s+;"," ",$part);
        else
          $new_parts[] = $part;
      }
      return implode("",$new_parts);
    }

    In clean_html

    replace
      $string = str_replace( '\n', ' ', $string );
    with
      $string = $this->strip_insignificant_html_whitespace($string);

    In get_post in the !empty($my_post['post_content']))

    replace
      $my_post['post_content'] = ereg_replace("[\n\r]", " ", $my_post['post_content']);
    with
      $my_post['post_content'] = $this->strip_insignificant_html_whitespace($my_post['post_content']);

    It would be nice also to strip the contents of cdata blocks and <script>..</script> blocks cleanly. I find examples like

    <div id="googleAds">
      <!-- b e g i n   g o o g l e  a d s  -->
      <script type="text/javascript">
        //<![CDATA[
        <!--
        google_ad_client = "...";
        google_ad_slot = "...";
        google_ad_width = ...;
        google_ad_height = ...;
        //-->
        //]]>
      </script>
      <script type="text/javascript" src="/data/../pagead2.googlesyndication.com/pagead/show_ads.js">
      </script> <!-- e n d   g o o g l e  a d s  -->
    </div>

    that are not stripped cleanly by the application of the php strip_tags function in the plugin.

    Thread Starter Mark Tuttle

    (@markrtuttle)

    What I had missed (forgotten) was that the installation of the TinyMCE Advanced plugin adds a configuration page to the dashboard. Go to Dashboard -> Settings -> TinyMCE Advanced. There I have the ability to drag and drop the new editing buttons I want into the rows of editing buttons in the visual editor. I saved that, and the buttons now appear in the visual editor. I had thought they would appear in the visual editor by default after installation. I had forgotten that I have to choose the buttons I want.

Viewing 15 replies - 1 through 15 (of 20 total)