<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Documentation on Marginalia Search Engine Software Documentation</title>
    <link>https://docs.marginalia.nu/</link>
    <description>Recent content in Documentation on Marginalia Search Engine Software Documentation</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Wed, 17 Jan 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://docs.marginalia.nu/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>1.1 Hardware and Configuration</title>
      <link>https://docs.marginalia.nu/1_overview/01_hardware/</link>
      <pubDate>Tue, 16 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/1_overview/01_hardware/</guid>
      <description>The Marginalia Search Engine is designed to be run on a single server. The server should be an x86-64 machine, have at least 16GB of RAM, and at least 4 cores. It is designed to run on physical hardware, and will likely be very expensive to run in the cloud.
Overall the system is designed to be run on a single server, but it is possible to run the index nodes on separate servers.</description>
    </item>
    
    <item>
      <title>1.2 Software Requirements</title>
      <link>https://docs.marginalia.nu/1_overview/02_software-reqs/</link>
      <pubDate>Tue, 16 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/1_overview/02_software-reqs/</guid>
      <description>The software requirements for running the Marginalia Search Engine are:
Linux, probably doesn&amp;rsquo;t matter which distro, the instructions assume something Debian or Ubuntu-like. If you&amp;rsquo;re running something else, especially something other than Linux, it may or may not work. Docker (install guides) Docker Compose (install guides) Java (use sdkman to install): JDK 22 for the latest released version of the system JDK 23 for the head of the git repository NPM </description>
    </item>
    
    <item>
      <title>1.3 Installing</title>
      <link>https://docs.marginalia.nu/1_overview/03_installing/</link>
      <pubDate>Tue, 16 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/1_overview/03_installing/</guid>
      <description>To install the search engine software, you need to clone the repository.
$ git clone https://github.com/MarginaliaSearch/MarginaliaSearch.git This will create a directory called MarginaliaSearch in your current directory. Change into that directory, and run the setup.sh script.
It will download a bunch of additional files, primarily from https://downloads.marginalia.nu. This is necessary as the search engine uses large binary model files for language processing, and these don&amp;rsquo;t agree well with git.
$ run/setup.</description>
    </item>
    
    <item>
      <title>2.1 New Crawl</title>
      <link>https://docs.marginalia.nu/2_crawling/1_new_crawl/</link>
      <pubDate>Tue, 16 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/2_crawling/1_new_crawl/</guid>
      <description>NOTE: Please be sure to read the crawling disclaimer before proceeding.
Bootstrapping the domain database While a running search engine can use the link database to figure out which websites to visit, a clean system does not know of any links, so you must add a few domains yourself. To do this, either follow the link in the New Crawl GUI, or use the top menu and select Domains-&amp;gt;Add Domains</description>
    </item>
    
    <item>
      <title>2.2 Recrawling</title>
      <link>https://docs.marginalia.nu/2_crawling/2_recrawl/</link>
      <pubDate>Tue, 16 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/2_crawling/2_recrawl/</guid>
      <description>The work flow with a crawl spec was a one-off process to bootstrap the search engine. To keep the search engine up to date, it is preferable to do a recrawl. This will try to reduce the amount of data that needs to be fetched.
To trigger a Recrawl, go to Nodes-&amp;gt;Node N-&amp;gt;Actions-&amp;gt;Re-crawl. This will bring you to a page that looks similar to the &amp;rsquo;new crawl page&amp;rsquo;, where you can select a set of existing crawl data to use as a source.</description>
    </item>
    
    <item>
      <title>2.3 Processing and Loading</title>
      <link>https://docs.marginalia.nu/2_crawling/3_loading/</link>
      <pubDate>Tue, 16 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/2_crawling/3_loading/</guid>
      <description>Once the crawl is done, the data needs to be processed before its searchable. This process extracts keywords and features from the documents, and converts them into a format that can be loaded into the search engine.
This is done by going to Nodes-&amp;gt;Node N-&amp;gt;Actions-&amp;gt;Process Crawl Data.
Process Crawl Data Dialog This will start the conversion process. This will again take a while, depending on the size of the crawl. The process bar will show the progress.</description>
    </item>
    
    <item>
      <title>1.4 Configuration</title>
      <link>https://docs.marginalia.nu/1_overview/04_configuring/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/1_overview/04_configuring/</guid>
      <description>After installing, a directory structure will be created in the install directory. In it, the following files and directories will be created:
path description conf/properties java-style properties files for configuring the search engine conf/suggestions.txt A list of suggestions for the search interface conf/db.properties JDBC configuration env/mariadb.env Environment variables for the mariadb container env/service.env Environment variables for Marginalia Search services logs/ Log files model/ Language processing models index-1/backup Index backups for index node 1 index-1/index Index data for index node 1 index-1/storage Raw and processed crawl data for index node 1 index-1/work Temporary work directory for index node 1 index-1/uploads Upload directory for index node 1 index-2/backup Index backups for index node 2 index-2/index Index data for index node 2 &amp;hellip; &amp;hellip; For a production-like deployment, you will probably want to move the db and index directories to a separate storage device.</description>
    </item>
    
    <item>
      <title>1.5 System Overview</title>
      <link>https://docs.marginalia.nu/1_overview/05_system_overview/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/1_overview/05_system_overview/</guid>
      <description>The search engine consists of several components, each of which is run in a separate Docker container.
Index Nodes The system is designed to be able to run with multiple partitions. At least one partition is required, but more can be added. Each partition is called an Index Node. Each index node is a separate Docker container, and can be run on a separate server, or on the same server.</description>
    </item>
    
    <item>
      <title>2.4.1 - WARCs</title>
      <link>https://docs.marginalia.nu/2_crawling/4_sideloading/1_warc/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/2_crawling/4_sideloading/1_warc/</guid>
      <description>WARC files are the standard format for web archives. They can be created e.g. with wget. The Marginalia software can read WARC files directly, and sideload them into the index, as long as each warc file contains only one domain.
Let&amp;rsquo;s for example archive www.marginalia.nu (I own this domain, so feel free to try this at home)
$ wget -r --warc-file=marginalia www.marginalia.nu Note If you intend to do this on other websites, you should probably add a --wait parameter to wget, e.</description>
    </item>
    
    <item>
      <title>2.4.2 - ZIM</title>
      <link>https://docs.marginalia.nu/2_crawling/4_sideloading/2_openzim/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/2_crawling/4_sideloading/2_openzim/</guid>
      <description>Wikipedia is the archetype of a website that is too large to crawl. Thankfully, they provide dumps of their data in a format called ZIM. This is a format that is optimized for offline use, and is used by the Kiwix project to provide offline access to Wikipedia. Wikipedia&amp;rsquo;s zim files are available for download at https://download.kiwix.org/zim/wikipedia/
Since the search engine doesn&amp;rsquo;t process images, we can use the smaller &amp;ldquo;no images&amp;rdquo; version of the dump, wikipedia_en_all_nopic_YYYY-MM.</description>
    </item>
    
    <item>
      <title>2.4.3 - Stackexchange</title>
      <link>https://docs.marginalia.nu/2_crawling/4_sideloading/3_stackexchange/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/2_crawling/4_sideloading/3_stackexchange/</guid>
      <description>The search engine is capable of side-loading stackexchange data dumps. These are available from https://archive.org/details/stackexchange. The data dumps are available in a compressed XML format.
It is probably a good idea to select the torrent option, as the files are quite large, and archive.org&amp;rsquo;s servers are not particularly fast. This will also allow you to limit the download to the sites you are interested in.
The system will digest the 7z files directly, so you don&amp;rsquo;t need to uncompress them.</description>
    </item>
    
    <item>
      <title>2.4.4 - Directory Tree</title>
      <link>https://docs.marginalia.nu/2_crawling/4_sideloading/4_dirtree/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/2_crawling/4_sideloading/4_dirtree/</guid>
      <description>For relatively small websites, ad-hoc side-loading is available directly from a folder structure on the hard drive. This is intended for loading manuals, documentation and similar data sets that are large and slowly changing.
A website can be archived with wget, like this
wget -nc -x --continue -w 1 -r -A &amp;#34;html&amp;#34; &amp;#34;docs.marginalia.nu&amp;#34; After doing this to a bunch of websites, create a YAML file in the upload directory, with contents something like this:</description>
    </item>
    
    <item>
      <title>2.5 - WARC export</title>
      <link>https://docs.marginalia.nu/2_crawling/5_warc_export/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/2_crawling/5_warc_export/</guid>
      <description>It is possible to configure the crawler to export its crawl data in a WARC format on top of the native parquet format. This is toggled in the node configuration, available from Index Nodes -&amp;gt; Node N -&amp;gt; Configuration
The node configuration panel, showing the `Keep WARC files during crawling` option If the option Keep WARC files during crawling is enabled, the crawler will retain a WARC record of the crawl.</description>
    </item>
    
    <item>
      <title>3.1 Node Configuration</title>
      <link>https://docs.marginalia.nu/3_configuration_options/1_node_configuration/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/3_configuration_options/1_node_configuration/</guid>
      <description>Under Nodes -&amp;gt; Node N -&amp;gt; Configuration, you will find a list of configuration options that can be set for each node.
Node Configuration Dialog Accept Queries This option toggles whether the query service will route queries to this node. This is useful if you want to take a node out of rotation for some reason.
Keep WARC files during crawling If this option is enabled, the WARC files be compacted into common-crawl style indexed WARC files.</description>
    </item>
    
    <item>
      <title>3.2 Data Sets</title>
      <link>https://docs.marginalia.nu/3_configuration_options/2_data_sets/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/3_configuration_options/2_data_sets/</guid>
      <description>Under System -&amp;gt; Data Sets, you will find a couple of options. Thse define URLs that the system will use to download data sets from.
Data Sets Data Set URLs Blogs The blogs list is a list of domains that the system considers to be blogs. This affects how these domains are processed, paths like /tags or /category are ignored, and the system will operate on the assumption that the content is a blog post.</description>
    </item>
    
    <item>
      <title>3.3 Domain Ranking Sets</title>
      <link>https://docs.marginalia.nu/3_configuration_options/3_ranking_sets/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/3_configuration_options/3_ranking_sets/</guid>
      <description>Under System -&amp;gt; Domain Ranking Sets, you will find a list of domain ranking sets. These are configurations for the domain ranking system. The domain ranking system is a system for assigning a score to each domain, which affects the order in which the domains are considered in the index. Thus a high ranking means results from a domain is more likely to be returned in a query.
A few domain ranking sets are reserved, and cannot be deleted.</description>
    </item>
    
    <item>
      <title>4.1 Model Files</title>
      <link>https://docs.marginalia.nu/4_data/1_model_files/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/4_data/1_model_files/</guid>
      <description>In the model/ directory, the following files are stored:
File Description English.DICT RDRPosTagger dictionary English.RDR RDRPosTagger model lid.176.ftz fasttext language identification model opennlp-sentence.bin OpenNLP sentence detector model opennlp-tokens.bin OpenNLP tokenizer model tfreq-new-algo3.bin Marginalia term frequency model ngrams.bin Marginalia n-grams model The RDRPosTagger models are used for fast part-of-speech tagging. These models, and additional models are available at the RDRPOSTagger git repository
The fasttext language identification model is used to identify the language of a document.</description>
    </item>
    
    <item>
      <title>4.2 Data Files</title>
      <link>https://docs.marginalia.nu/4_data/2_data_files/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/4_data/2_data_files/</guid>
      <description>In the data/ directory, the following files are stored:
File Description adblock.txt Adblock rules asn-data-raw-table CIDR-&amp;gt;ASN data asn-used-autnums ASN-&amp;gt;AS registry data IP2LOCATION-LITE-DB1.CSV IP2Location data atags.parquet Anchor tags adblock.txt is used to detect ads and other problematic content in the crawled documents.
The asn-files are sourced from APNIC and used to map IP addresses to the corresponding CIDR and autononous system.
The ip2location data is from https://lite.ip2location.com/, and available under CC-BY-SA 4.</description>
    </item>
    
    <item>
      <title>4.3 Crawl Data, Processed Data, etc.</title>
      <link>https://docs.marginalia.nu/4_data/3_crawl_data/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/4_data/3_crawl_data/</guid>
      <description>The system stores crawl data in index-n/storage/, along with processed data and other various long-term data.
The data generated by the system is in general viewable within the control GUI, under Index Nodes -&amp;gt; Node N -&amp;gt; Storage.
Listing of data Clicking on the paths in this view will bring up details.
Data details screenshot This view will show the path of the data relative to the node storage root (e.</description>
    </item>
    
    <item>
      <title>4.4 Backups</title>
      <link>https://docs.marginalia.nu/4_data/4_backups/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/4_data/4_backups/</guid>
      <description>The system automatically snapshots the index data before the index is constructed. This allows relatively quick rollbacks of the index if for some reason this operation needs to be undone. The index data is stored in the index-n/backup directory, where n is the index node number.
Backup restoration is done from the control interface, under Node N-&amp;gt;Actions-&amp;gt;Restore Backup. This will restore the backup and the index will be rebuilt from there.</description>
    </item>
    
    <item>
      <title>5. Sample Crawl Data</title>
      <link>https://docs.marginalia.nu/2_crawling/4_sideloading/5_sample_data/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/2_crawling/4_sideloading/5_sample_data/</guid>
      <description>It is possible to download sample crawl data from the marginalia search project. This is useful for quickly setting up a test environment for experimentation or assessment of changes to the code.
Caveat If you load sample data into the system, these domains will also be included in future re-crawls. It is a good practice to segregate test environments with sample data from real environments, to avoid contamination of the domains table.</description>
    </item>
    
    <item>
      <title>Migration, 2024-03&#43;</title>
      <link>https://docs.marginalia.nu/6_notes/6_1__migrate_2024_03_plus/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://docs.marginalia.nu/6_notes/6_1__migrate_2024_03_plus/</guid>
      <description>After the end of February 2023, the project uses zookeeper for service discovery and migrates to a new docker build system. This is a fairly large change, and requires a few manual migration steps to keep using an existing installation.
Easy way: Do a clean install somewhere and copy the docker-compose.yml and env/service.env to your existing install.
Hard way: Add a zookeeper service to the docker-compose file. services: ... zookeeper: image: zookeeper container_name: &amp;#34;zookeeper&amp;#34; restart: always ports: - &amp;#34;127.</description>
    </item>
    
  </channel>
</rss>
