Doomsday clock parsing and plotting

Introduction

The Doomsday Clock is a symbolic timepiece maintained by the Bulletin of the Atomic Scientists (BAS) since 1947. It represents how close humanity is perceived to be to global catastrophe, primarily nuclear war but also including climate change and biological threats. The clock’s hands are set annually to reflect the current state of global security; midnight signifies theoretical doomsday.

In this notebook we consider two tasks:

  • Parsing of Doomsday Clock reading statements
  • Evolution of Doomsday Clock times
    • We extract relevant Doomsday Clock timeline data from the corresponding Wikipedia page.
      • (Instead of using a page from BAS.)
    • We show how timeline data from that Wikipedia page can be processed with “standard” Wolfram Language (WL) functions and with LLMs.
    • The result plot shows the evolution of the minutes to midnight.
      • The plot could show trends, highlighting significant global events that influenced the clock setting.
      • Hence, we put in informative callouts and tooltips.

The data extraction and visualization in the notebook serve educational purposes or provide insights into historical trends of global threats as perceived by experts. We try to make the ingestion and processing code universal and robust, suitable for multiple evaluations now or in the (near) future.

Remark: Keep in mind that the Doomsday Clock is a metaphor and its settings are not just data points but reflections of complex global dynamics (by certain experts and a board of sponsors.)

Remark: Currently (2024-12-30) Doomsday Clock is set at 90 seconds before midnight.

Data ingestion

Here we ingest the Doomsday Clock timeline page and show corresponding statistics:

url = "https://thebulletin.org/doomsday-clock/timeline/";
txtEN = Import[url, "Plaintext"];
TextStats[txtEN]

(*<|"Characters" -> 77662, "Words" -> 11731, "Lines" -> 1119|>*)

By observing the (plain) text of that page we see the Doomsday Clock time setting can be extracted from the sentence(s) that begin with the following phrase:

startPhrase = "Bulletin of the Atomic Scientists";
sentence = Select[Map[StringTrim, StringSplit[txtEN, "\n"]], StringStartsQ[#, startPhrase] &] // First

(*"Bulletin of the Atomic Scientists, with a clock reading 90 seconds to midnight"*)

Grammar and parsers

Here is a grammar in Extended Backus-Naur Form (EBNF) for parsing Doomsday Clock statements:

ebnf = "
<TOP> = <clock-reading>  ;
<clock-reading> = <opening> , ( <minutes> | [ <minutes> , [ 'and' | ',' ] ] , <seconds> ) , 'to' , 'midnight' ;
<opening> = [ { <any> } ] , 'clock' , [ 'is' ] , 'reading' ; 
<any> = '_String' ;
<minutes> = <integer> <& ( 'minute' | 'minutes' )  <@ \"Minutes\"->#&;
<seconds> = <integer> <& ( 'second' | 'seconds' ) <@ \"Seconds\"->#&;
<integer> = '_?IntegerQ' ;";

Remark: The EBNF grammar above can be obtained with LLMs using a suitable prompt with example sentences. (We do not discuss that approach further in this notebook.)

Here the parsing functions are generated from the EBNF string above:

ClearAll["p*"]
res = GenerateParsersFromEBNF[ParseToEBNFTokens[ebnf]];
res // LeafCount

(*375*)

We must redefine the parser pANY (corresponding to the EBNF rule “”) in order to prevent pANY of gobbling the word “clock” and in that way making the parser pOPENING fail.

pANY = ParsePredicate[StringQ[#] && # != "clock" &];

Here are random sentences generated with the grammar:

SeedRandom[32];
GrammarRandomSentences[GrammarNormalize[ebnf], 6] // Sort // ColumnForm

54jfnd 9y2f clock is reading 46 second to midnight
clock is reading 900 minutes to midnight
clock is reading 955 second to midnight
clock reading 224 minute to midnight
clock reading 410 minute to midnight
jdsf5at clock reading 488 seconds to midnight

Verifications of the (sub-)parsers:

pSECONDS[{"90", "seconds"}]

(*{{{}, "Seconds" -> 90}}*)

pOPENING[ToTokens@"That doomsday clock is reading"]

(*{{{}, {{"That", "doomsday"}, {"clock", {"is", "reading"}}}}}*)

Here the “top” parser is applied:

str = "the doomsday clock is reading 90 seconds to midnight";
pTOP[ToTokens@str]

(*{{{}, {{{"the", "doomsday"}, {"clock", {"is", "reading"}}}, {{{}, "Seconds" -> 90}, {"to", "midnight"}}}}}*)

Here the sentence extracted above is parsed and interpreted into an association with keys “Minutes” and “Seconds”:

aDoomReading = Association@Cases[Flatten[pTOP[ToTokens@sentence]], _Rule]

(*<|"Seconds" -> 90|>*)

Plotting the clock

Using the interpretation derived above here we make a list suitable for ClockGauge:

clockShow = DatePlus[{0, 0, 0, 12, 0, 0}, {-(Lookup[aDoomReading, "Minutes", 0]*60 + aDoomReading["Seconds"]), "Seconds"}]

(*{-2, 11, 30, 11, 58, 30}*)

In that list, plotting of a Doomsday Clock image (or gauge) is trivial.

ClockGauge[clockShow, GaugeLabels -> Automatic]

Let us define a function that makes the clock-gauge plot for a given association.

Clear[DoomsdayClockGauge];
Options[DoomsdayClockGauge] = Options[ClockGauge];
DoomsdayClockGauge[m_Integer, s_Integer, opts : OptionsPattern[]] := DoomsdayClockGauge[<|"Minutes" -> m, "Seconds" -> s|>, opts];
DoomsdayClockGauge[a_Association, opts : OptionsPattern[]] :=
  Block[{clockShow},
   clockShow = DatePlus[{0, 0, 0, 12, 0, 0}, {-(Lookup[a, "Minutes", 0]*60 + Lookup[a, "Seconds", 0]), "Seconds"}];
   ClockGauge[clockShow, opts, GaugeLabels -> Placed[Style["Doomsday\nclock", RGBColor[0.7529411764705882, 0.7529411764705882, 0.7529411764705882], FontFamily -> "Krungthep"], Bottom]]
   ];

Here are examples:

Row[{
   DoomsdayClockGauge[17, 0], 
   DoomsdayClockGauge[1, 40, GaugeLabels -> Automatic, PlotTheme -> "Scientific"], 
   DoomsdayClockGauge[aDoomReading, PlotTheme -> "Marketing"] 
  }]

More robust parsing

More robust parsing of Doomsday Clock statements can be obtained in these three ways:

  • “Fuzzy” match of words
    • For misspellings like “doomsdat” instead of “doomsday.”
  • Parsing of numeric word forms.
    • For statements, like, “two minutes and twenty five seconds.”
  • Delegating the parsing to LLMs when grammar parsing fails.

Fuzzy matching

The parser ParseFuzzySymbol can be used to handle misspellings (via EditDistance):

pDD = ParseFuzzySymbol["doomsday", 2];
lsPhrases = {"doomsdat", "doomsday", "dumzday"};
ParsingTestTable[pDD, lsPhrases]

In order to include the misspelling handling into the grammar we manually rewrite the grammar. (The grammar is small, so, it is not that hard to do.)

pANY = ParsePredicate[StringQ[#] && EditDistance[#, "clock"] > 1 &];
pOPENING = ParseOption[ParseMany[pANY]]⊗ParseFuzzySymbol["clock", 1]⊗ParseOption[ParseSymbol["is"]]⊗ParseFuzzySymbol["reading", 2];
pMINUTES = "Minutes" -> # &⊙(pINTEGER ◁ ParseFuzzySymbol["minutes", 3]);
pSECONDS = "Seconds" -> # &⊙(pINTEGER ◁ ParseFuzzySymbol["seconds", 3]);
pCLOCKREADING = Cases[#, _Rule, Infinity] &⊙(pOPENING⊗(pMINUTES⊕ParseOption[pMINUTES⊗ParseOption[ParseSymbol["and"]⊕ParseSymbol["&"]⊕ParseSymbol[","]]]⊗pSECONDS)⊗ParseSymbol["to"]⊗ParseFuzzySymbol["midnight", 2]);

Here is a verification table with correct- and incorrect spellings:

lsPhrases = {
    "doomsday clock is reading 2 seconds to midnight", 
    "dooms day cloc is readding 2 minute and 22 sekonds to mildnight"};
ParsingTestTable[pCLOCKREADING, lsPhrases, "Layout" -> "Vertical"]

Parsing of numeric word forms

One way to make the parsing more robust is to implement the ability to parse integer names (or numeric word forms) not just integers.

Remark: For a fuller discussion — and code — of numeric word forms parsing see the tech note “Integer names parsing” of the paclet “FunctionalParsers”, [AAp1].

First, we make an association that connects integer names with corresponding integer values

aWordedValues = Association[IntegerName[#, "Words"] -> # & /@ Range[0, 100]];
aWordedValues = KeyMap[StringRiffle[StringSplit[#, RegularExpression["\\W"]], " "] &, aWordedValues];
Length[aWordedValues]

(*101*)

Here is how the rules look like:

aWordedValues[[1 ;; -1 ;; 20]]

(*<|"zero" -> 0, "twenty" -> 20, "forty" -> 40, "sixty" -> 60, "eighty" -> 80, "one hundred" -> 100|>*)

Here we program the integer names parser:

pUpTo10 = ParseChoice @@ Map[ParseSymbol[IntegerName[#, {"English", "Words"}]] &, Range[0, 9]];
p10s = ParseChoice @@ Map[ParseSymbol[IntegerName[#, {"English", "Words"}]] &, Range[10, 100, 10]];
pWordedInteger = ParseApply[aWordedValues[StringRiffle[Flatten@{#}, " "]] &, p10s\[CircleTimes]pUpTo10\[CirclePlus]p10s\[CirclePlus]pUpTo10];

Here is a verification table of that parser:

lsPhrases = {"three", "fifty seven", "thirti one"};
ParsingTestTable[pWordedInteger, lsPhrases]

There are two parsing results for “fifty seven”, because pWordedInteger is defined with p10s⊗pUpTo10⊕p10s… . This can be remedied by using ParseJust or ParseShortest:

lsPhrases = {"three", "fifty seven", "thirti one"};
ParsingTestTable[ParseJust@pWordedInteger, lsPhrases]

Let us change pINTEGER to parse both integers and integer names:

pINTEGER = (ToExpression\[CircleDot]ParsePredicate[StringMatchQ[#, NumberString] &])\[CirclePlus]pWordedInteger;
lsPhrases = {"12", "3", "three", "forty five"};
ParsingTestTable[pINTEGER, lsPhrases]

Let us try the new parser using integer names for the clock time:

str = "the doomsday clock is reading two minutes and forty five seconds to midnight";
pTOP[ToTokens@str]

(*{{{}, {"Minutes" -> 2, "Seconds" -> 45}}}*)

Enhance with LLM parsing

There are multiple ways to employ LLMs for extracting “clock readings” from arbitrary statements for Doomsday Clock readings, readouts, and measures. Here we use LLM few-shot training:

flop = LLMExampleFunction[{
    "the doomsday clock is reading two minutes and forty five seconds to midnight" -> "{\"Minutes\":2, \"Seconds\": 45}", 
    "the clock of the doomsday gives 92 seconds to midnight" -> "{\"Minutes\":0, \"Seconds\": 92}", 
    "The bulletin atomic scienist maybe is set to a minute an 3 seconds." -> "{\"Minutes\":1, \"Seconds\": 3}" 
   }, "JSON"]

Here is an example invocation:

flop["Maybe the doomsday watch is at 23:58:03"]

(*{"Minutes" -> 1, "Seconds" -> 57}*)

The following function combines the parsing with the grammar and the LLM example function — the latter is used for fallback parsing:

Clear[GetClockReading];
GetClockReading[st_String] := 
   Block[{op}, 
    op = ParseJust[pTOP][ToTokens[st]]; 
    Association@
     If[Length[op] > 0 && op[[1, 1]] === {}, 
      Cases[op, Rule], 
     (*ELSE*) 
      flop[st] 
     ] 
   ];

Robust parser demo

Here is the application of the combine function above over a certain “random” Doomsday Clock statement:

s = "You know, sort of, that dooms-day watch is 1 and half minute be... before the big boom. (Of doom...)";
GetClockReading[s]

(*<|"Minutes" -> 1, "Seconds" -> 30|>*)

Remark: The same type of robust grammar-and-LLM combination is explained in more detail in the video “Robust LLM pipelines (Mathematica, Python, Raku)”, [AAv1]. (See, also, the corresponding notebook [AAn1].)

Timeline

In this section we extract Doomsday Clock timeline data and make a corresponding plot.

Parsing page data

Instead of using the official Doomsday clock timeline page we use Wikipedia:

url = "https://en.wikipedia.org/wiki/Doomsday_Clock";
data = Import[url, "Data"];

Get timeline table:

tbl = Cases[data, {"Timeline of the Doomsday Clock [ 13 ] ", x__} :> x, Infinity] // First;

Show table’s columns:

First[tbl]

(*{"Year", "Minutes to midnight", "Time ( 24-h )", "Change (minutes)", "Reason", "Clock"}*)

Make a dataset:

dsTbl = Dataset[Rest[tbl]][All, AssociationThread[{"Year", "MinutesToMidnight", "Time", "Change", "Reason"}, #] &];
dsTbl = dsTbl[All, Append[#, "Date" -> DateObject[{#Year, 7, 1}]] &];
dsTbl[[1 ;; 4]]

Here is an association used to retrieve the descriptions from the date objects:

aDateToDescr = Normal@dsTbl[Association, #Date -> BreakStringIntoLines[#Reason] &];

Using LLM-extraction instead

Alternatively, we can extract the Doomsday Clock timeline using LLMs. Here we get the plaintext of the Wikipedia page and show statistics:

txtWk = Import[url, "Plaintext"];
TextStats[txtWk]

(*<|"Characters" -> 43623, "Words" -> 6431, "Lines" -> 315|>*)

Here we get the Doomsday Clock timeline table from that page in JSON format using an LLM:

res = 
  LLMSynthesize[{
    "Give the time table of the doomsday clock as a time series that is a JSON array.", 
    "Each element of the array is a dictionary with keys 'Year', 'MinutesToMidnight', 'Time', 'Description'.", 
    txtWk, 
    LLMPrompt["NothingElse"]["JSON"] 
   }, 
   LLMEvaluator -> LLMConfiguration[<|"Provider" -> "OpenAI", "Model" -> "gpt-4o", "Temperature" -> 0.4, "MaxTokens" -> 5096|>] 
  ]

(*"```json[{\"Year\": 1947, \"MinutesToMidnight\": 7, \"Time\": \"23:53\", \"Description\": \"The initial setting of the Doomsday Clock.\"},{\"Year\": 1949, \"MinutesToMidnight\": 3, \"Time\": \"23:57\", \"Description\": \"The Soviet Union tests its first atomic bomb, officially starting the nuclear arms race.\"}, ... *)

Post process the LLM result:

res2 = ToString[res, CharacterEncoding -> "UTF-8"];
res3 = StringReplace[res2, {"```json", "```"} -> ""];
res4 = ImportString[res3, "JSON"];
res4[[1 ;; 3]]

(*{{"Year" -> 1947, "MinutesToMidnight" -> 7, "Time" -> "23:53", "Description" -> "The initial setting of the Doomsday Clock."}, {"Year" -> 1949, "MinutesToMidnight" -> 3, "Time" -> "23:57", "Description" -> "The Soviet Union tests its first atomic bomb, officially starting the nuclear arms race."}, {"Year" -> 1953, "MinutesToMidnight" -> 2, "Time" -> "23:58", "Description" -> "The United States and the Soviet Union test thermonuclear devices, marking the closest approach to midnight until 2020."}}*)

Make a dataset with the additional column “Date” (having date-objects):

dsDoomsdayTimes = Dataset[Association /@ res4];
dsDoomsdayTimes = dsDoomsdayTimes[All, Append[#, "Date" -> DateObject[{#Year, 7, 1}]] &];
dsDoomsdayTimes[[1 ;; 4]]

Here is an association that is used to retrieve the descriptions from the date objects:

aDateToDescr2 = Normal@dsDoomsdayTimes[Association, #Date -> #Description &];

Remark: The LLM derived descriptions above are shorter than the descriptions in the column “Reason” of the dataset obtained parsing the page data. For the plot tooltips below we use the latter.

Timeline plot

In order to have informative Doomsday Clock evolution plot we obtain and partition dataset’s time series into step-function pairs:

ts0 = Normal@dsDoomsdayTimes[All, {#Date, #MinutesToMidnight} &];
ts2 = Append[Flatten[MapThread[Thread[{#1, #2}] &, {Partition[ts0[[All, 1]], 2, 1], Most@ts0[[All, 2]]}], 1], ts0[[-1]]];

Here are corresponding rule wrappers indicating the year and the minutes before midnight:

lbls = Map[Row[{#Year, Spacer[3], "\n", IntegerPart[#MinutesToMidnight], Spacer[2], "m", Spacer[2], Round[FractionalPart[#MinutesToMidnight]*60], Spacer[2], "s"}] &, Normal@dsDoomsdayTimes];
lbls = Map[If[#[[1, -3]] == 0, Row@Take[#[[1]], 6], #] &, lbls];

Here the points “known” by the original time series are given callouts:

aRules = Association@MapThread[#1 -> Callout[Tooltip[#1, aDateToDescr[#1[[1]]]], #2] &, {ts0, lbls}];
ts3 = Lookup[aRules, Key[#], #] & /@ ts2;

Finally, here is the plot:

DateListPlot[ts3, 
  PlotStyle -> Directive[{Thickness[0.007`], Orange}],
  Epilog -> {PointSize[0.01`], Black, Point[ts0]}, 
  PlotLabel -> Row[(Style[#1, FontSize -> 16, FontColor -> Black, FontFamily -> "Verdana"] &) /@ {"Doomsday clock: minutes to midnight,", Spacer[3], StringRiffle[MinMax[Normal[dsDoomsdayTimes[All, "Year"]]], "-"]}], 
  FrameLabel -> {"Year", "Minutes to midnight"}, 
  Background -> GrayLevel[0.94`], Frame -> True, 
  FrameTicks -> {{Automatic, (If[#1 == 0, {0, Style["00:00", Red]}, {#1, Row[{"23:", 60 - #1}]}] &) /@ Range[0, 17]}, {Automatic, Automatic}}, GridLines -> {None, All},
  AspectRatio -> 1/3, ImageSize -> 1200
]

Remark: By hovering with the mouse over the black points the corresponding descriptions can be seen. We considered using clock-gauges as tooltips, but showing clock-settings reasons is more informative.

Remark: The plot was intentionally made to resemble the timeline plot in Doomsday Clock’s Wikipedia page.

Conclusion

As expected, parsing, plotting, or otherwise processing the Doomsday Clock settings and statements are excellent didactic subjects for textual analysis (or parsing) and temporal data visualization. The visualization could serve educational purposes or provide insights into historical trends of global threats as perceived by experts. (Remember, the clock’s settings are not just data points but reflections of complex global dynamics.)

One possible application of the code in this notebook is to make a “web service“ that gives clock images with Doomsday Clock readings. For example, click on this button:

Setup

Needs["AntonAntonov`FunctionalParsers`"]

Clear[TextStats];
TextStats[s_String] := AssociationThread[{"Characters", "Words", "Lines"}, Through[{StringLength, Length@*TextWords, Length@StringSplit[#, "\n"] &}[s]]];

BreakStringIntoLines[str_String, maxLength_Integer : 60] := Module[
    {words, lines, currentLine}, 
    words = StringSplit[StringReplace[str, RegularExpression["\\v+"] -> " "]]; 
    lines = {}; 
    currentLine = ""; 
    Do[
       If[StringLength[currentLine] + StringLength[word] + 1 <= maxLength, 
          currentLine = StringJoin[currentLine, If[currentLine === "", "", " "], word], 
          AppendTo[lines, currentLine]; 
          currentLine = word; 
        ], 
       {word, words} 
     ]; 
    AppendTo[lines, currentLine]; 
    StringJoin[Riffle[lines, "\n"]] 
  ]

References

Articles, notebooks

[AAn1] Anton Antonov, “Making robust LLM computational pipelines from software engineering perspective”, (2024), Wolfram Community.

Paclets

[AAp1] Anton Antonov, “FunctionalParsers”, (2023), Wolfram Language Paclet Repository.

Videos

[AAv1] Anton Antonov, “Robust LLM pipelines (Mathematica, Python, Raku)”, (2024), YouTube/@AAA4prediction.

Robust LLM pipelines

… or “Making Robust LLM Computational Pipelines from Software Engineering Perspective”

Abstract

Large Language Models (LLMs) are powerful tools with diverse capabilities, but from Software Engineering (SE) Point Of View (POV) they are unpredictable and slow. In this presentation we consider five ways to make more robust SE pipelines that include LLMs. We also consider a general methodological workflow for utilizing LLMs in “every day practice.”

Here are the five approaches we consider:

  1. DSL for configuration-execution-conversion
    • Infrastructural, language-design level solution
  2. Detailed, well crafted prompts
    • AKA “Prompt engineering”
  3. Few-shot training with examples
  4. Via a Question Answering System (QAS) and code templates
  5. Grammar-LLM chain of responsibility
  6. Testings with data types and shapes over multiple LLM results

Compared to constructing SE pipelines, Literate Programming (LP) offers a dual or alternative way to use LLMs. For that it needs support and facilitation of:

  • Convenient LLM interaction (or chatting)
  • Document execution (weaving and tangling)

The discussed LLM workflows methodology is supported in Python, Raku, Wolfram Language (WL). The support in R is done via Python (with “reticulate”, [TKp1].)

The presentation includes multiple examples and showcases.

Modeling of the LLM utilization process is hinted but not discussed.

Here is a mind-map of the presentation:

Here are the notebook used in the presentation:


General structure of LLM-based workflows

All systematic approaches of unfolding and refining workflows based on LLM functions, will include several decision points and iterations to ensure satisfactory results.

This flowchart outlines such a systematic approach:


References

Articles, blog posts

[AA1] Anton Antonov, “Workflows with LLM functions”, (2023), RakuForPrediction at WordPress.

Notebooks

[AAn1] Anton Antonov, “Workflows with LLM functions (in Raku)”, (2023), Wolfram Community.

[AAn2] Anton Antonov, “Workflows with LLM functions (in Python)”, (2023), Wolfram Community.

[AAn3] Anton Antonov, “Workflows with LLM functions (in WL)”, (2023), Wolfram Community.

Packages

Raku

[AAp1] Anton Antonov, LLM::Functions Raku package, (2023-2024), GitHub/antononcube. (raku.land)

[AAp2] Anton Antonov, LLM::Prompts Raku package, (2023-2024), GitHub/antononcube. (raku.land)

[AAp3] Anton Antonov, Jupyter::Chatbook Raku package, (2023-2024), GitHub/antononcube. (raku.land)

Python

[AAp4] Anton Antonov, LLMFunctionObjects Python package, (2023-2024), PyPI.org/antononcube.

[AAp5] Anton Antonov, LLMPrompts Python package, (2023-2024), GitHub/antononcube.

[AAp6] Anton Antonov, JupyterChatbook Python package, (2023-2024), GitHub/antononcube.

[MWp1] Marc Wouts, jupytext Python package, (2021-2024), GitHub/mwouts.

R

[TKp1] Tomasz Kalinowski, Kevin Ushey, JJ Allaire, RStudio, Yuan Tang, reticulate R package, (2016-2024)

Videos

[AAv1] Anton Antonov, “Robust LLM pipelines (Mathematica, Python, Raku)”, (2024), YouTube/@AAA4Predictions.

[AAv2] Anton Antonov, “Integrating Large Language Models with Raku”, (2023), The Raku Conference 2023 at YouTube.

Age at creation for programming languages stats

Introduction

In this blog post (notebook) we ingest programming languages creation data from Programming Language DataBase” and visualize several statistics of it.

We do not examine the data source and we do not want to reason too much about the data using the stats. We started this notebook by just wanting to make the bubble charts (both 2D and 3D.) Nevertheless, we are tempted to say and justify statements like:

  • Pareto holds, as usual.
  • Language creators tend to do it more than once.
  • Beware the Second system effect.

References

Here are reference links with explanations and links to dataset files:


Data ingestion

Here we get the TSC file with Wolfram Function Repository (WFR) function ImportCSVToDataset:

url = "https://pldb.io/posts/age.tsv";
dsData = ResourceFunction["ImportCSVToDataset"][url, "Dataset", "FieldSeparators" -> "\t"];
dsData[[1 ;; 4]]

Here we summarize the data using the WFR function RecordsSummary:

ResourceFunction["RecordsSummary"][dsData, "MaxTallies" -> 12]

Here is a list of languages we use to “get orientated” in the plots below:

lsFocusLangs = {"C++", "Fortran", "Java", "Mathematica", "Perl 6", "Raku", "SQL", "Wolfram Language"};

Here we find the most important tags (used in the plots below):

lsTopTags = ReverseSortBy[Tally[Normal@dsData[All, "tags"]], Last][[1 ;; 7, 1]]

(*{"pl", "textMarkup", "dataNotation", "grammarLanguage", "queryLanguage", "stylesheetLanguage", "protocol"}*)

Here we add the column “group” based on the focus languages and most important tags:

dsData = dsData[All, Append[#, "group" -> Which[MemberQ[lsFocusLangs, #id], "focus", MemberQ[lsTopTags, #tags], #tags, True, "other"]] &];

Distributions

Here are the distributions of the variables/columns:

  • age at creation
    • i.e. “How old was the creator?”
  • appeared”
    • i.e. “In what year the programming language was proclaimed?”
Association @ Map[# -> Histogram[Normal@dsData[All, #], 20, "Probability", Sequence[ImageSize -> Medium, PlotTheme -> "Detailed"]] &, {"ageAtCreation", "appeared"}]

Here are corresponding Box-Whisker plots together with tables of their statistics:

aBWCs = Association@
Map[# -> BoxWhiskerChart[Normal@dsData[All, #], "Outliers", Sequence[BarOrigin -> Left, ImageSize -> Medium, AspectRatio -> 1/2, PlotRange -> Full]] &, {"ageAtCreation", "appeared"}];

Pareto principle manifestation

Number of creations

Here is the Pareto principle plot of for the number of created (or renamed) programming languages per creator (using the WFR function ParetoPrinciplePlot):

ResourceFunction["ParetoPrinciplePlot"][Association[Rule @@@ Tally[Normal@dsData[All, "creators"]]], ImageSize -> Large]

We can see that ≈25% of the creators correspond to ≈50% of the languages.

Popularity

Obviously, programmers can and do use more than one programming language. Nevertheless, it is interesting to see the Pareto principle plot for the languages “mind share” based on the number of users estimates.

ResourceFunction["ParetoPrinciplePlot"][Normal@dsData[Association, #id -> #numberOfUsersEstimate &], ImageSize -> Large]

Remark: Again, the plot above is “wrong” — programmers use more than one programming language.


Correlations

In order to see meaningful correlation, pairwise plots we take logarithms of the large value columns:

dsDataVar = dsData[All, {"appeared", "ageAtCreation", "numberOfUsersEstimate", "numberOfJobsEstimate", "rank", "measurements", "pldbScore"}];
dsDataVar = dsDataVar[All, Append[#, <|"numberOfUsersEstimate" -> Log10[#numberOfUsersEstimate + 1], "numberOfJobsEstimate" -> Log10[#numberOfJobsEstimate + 1]|>] &];

Remark: Note that we “cheat” by adding 1 before taking the logarithms.

We obtain the tables of correlations plots using the newly introduced, experimental PairwiseListPlot. If we remove the rows with zeroes some of the correlations become more obvious. Here is the corresponding tab view of the two correlation tables:

TabView[{
"data" -> PairwiseListPlot[dsDataVar, PlotTheme -> "Business", ImageSize -> 800],
"zero-free data" -> PairwiseListPlot[dsDataVar[Select[FreeQ[Values[#], 0] &]], PlotTheme -> "Business", ImageSize -> 800]}]

Remark: Given the names of the data columns and the corresponding obvious interpretations we can say that the stronger correlations make sense.


Bubble chart 2D

In this section we make an informative 2D bubble chart with (tooltips).

First, note that not all triplets of “appeared”,”ageAtCreation”, and “numberOfUsersEstimate” are unique:

ReverseSortBy[Tally[Normal[dsData[All, {"appeared", "ageAtCreation", "numberOfUsersEstimate"}]]], Last][[1 ;; 3]]

(*{{<|"appeared" -> 2017, "ageAtCreation" -> 33, "numberOfUsersEstimate" -> 420|>, 2}, {<|"appeared" -> 2023, "ageAtCreation" -> 39, "numberOfUsersEstimate" -> 11|>, 1}, {<|"appeared" -> 2022, "ageAtCreation" -> 55, "numberOfUsersEstimate" -> 6265|>, 1}}*)

Hence we make two datasets: (1) one for the core bubble chart, (2) the other for the labeling function:

aData = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "numberOfUsersEstimate"}] &];
aData2 = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "numberOfUsersEstimate", "id", "creators"}] &];

Here is the labeling function (see the section “Applications” of the function page of BubbleChart):

Clear[LangLabeler];
LangLabeler[v_, {r_, c_}, ___] := Placed[Grid[{
{Style[aData2[[r, c]]["id"], Bold, 12], SpanFromLeft},
{"Creator(s):", aData2[[r, c]]["creators"]},
{"Appeared:", aData2[[r, c]]["appeared"]},
{"Age at creation:", aData2[[r, c]]["ageAtCreation"]},
{"Number of users:", aData2[[r, c]]["numberOfUsersEstimate"]}
}, Alignment -> Left], Tooltip];

Here is the bubble chart:

BubbleChart[
aData,
FrameLabel -> {"Age at Creation", "Appeared"},
PlotLabel -> "Number of users estimate",
BubbleSizes -> {0.05, 0.14},
LabelingFunction -> LangLabeler,
AspectRatio -> 1/2.5,
ChartStyle -> 7,
PlotTheme -> "Detailed",
ChartLegends -> {Keys[aData], None},
ImageSize -> 1000
]

Remark: The programming language J is a clear outlier because of creators’ ages.


Bubble chart 3D

In this section we a 3D bubble chart.

As in the previous section we define two datasets: for the core plot and for the tooltips:

aData3D = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "measurements", "numberOfUsersEstimate"}] &];
aData3D2 = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "measurements", "numberOfUsersEstimate", "id", "creators"}] &];

Here is the corresponding labeling function:

Clear[LangLabeler3D];
LangLabeler3D[v_, {r_, c_}, ___] := Placed[Grid[{
{Style[aData3D2[[r, c]]["id"], Bold, 12], SpanFromLeft},
{"Creator(s):", aData3D2[[r, c]]["creators"]},
{"Appeared:", aData3D2[[r, c]]["appeared"]},
{"Age at creation:", aData3D2[[r, c]]["ageAtCreation"]},
{"Number of users:", aData3D2[[r, c]]["numberOfUsersEstimate"]}
}, Alignment -> Left], Tooltip];

Here is the 3D chart:

BubbleChart3D[
aData3D,
AxesLabel -> {"appeared", "ageAtCreation", "measuremnts"},
PlotLabel -> "Number of users estimate",
BubbleSizes -> {0.02, 0.07},
LabelingFunction -> LangLabeler3D,
BoxRatios -> {1, 1, 1},
ChartStyle -> 7,
PlotTheme -> "Detailed",
ChartLegends -> {Keys[aData], None},
ImageSize -> 1000
]

Remark: In the 3D bubble chart plot “Mathematica” and “Wolfram Language” are easier to discern.


Second system effect traces

In this section we try — and fail — to demonstrate that the more programming languages a team of creators makes the less successful those languages are. (Maybe, because they are more cumbersome and suffer the Second system effect?)

Remark: This section is mostly made “for fun.” It is not true that each sets of languages per creators team is made of comparable languages. For example, complementary languages can be in the same set. (See, HTTP, HTML, URL.) Some sets are just made of the same language but with different names. (See, Perl 6 and Raku, and Mathematica and Wolfram Language.) Also, older languages would have the First mover advantage.

Make creators to index association:

aCreators = KeySort@Association[Rule @@@ Select[Tally[Normal@dsData[All, "creators"]], #[[2]] > 1 &]];
aNameToIndex = AssociationThread[Keys[aCreators], Range[Length[aCreators]]];

Make a bubble chart with relative popularity per creators team:

aNUsers = Normal@GroupBy[dsData, #creators &, (m = Max[1, Max[Sqrt@KeyTake[#, "numberOfUsersEstimate"]]]; Map[Tooltip[{#appeared, #creators /. aNameToIndex, Sqrt[#numberOfUsersEstimate]/m}, Grid[{{Style[#id, Black, Bold], SpanFromLeft}, {"Creator(s):", #creators}, {"Users:", #numberOfUsersEstimate}}, Alignment -> Left]] &, #]) &];
aNUsers = KeySort@Select[aNUsers, Length[#] > 1 &];
BubbleChart[aNUsers, AspectRatio -> 2, BubbleSizes -> {0.02, 0.05}, ChartLegends -> Keys[aNUsers], ImageSize -> Large, GridLines -> {None, Values[aNameToIndex]}, FrameTicks -> {{Reverse /@ (List @@@ Normal[aNameToIndex]), None}, {Automatic, Automatic}}]

From the plot above we cannot decisively say that:

The most recent creation of a team of programming language creators is not team's most popular creation.

That statement, though, does hold for a fair amount of cases.


Instead of conclusions

Consider:

  • Making an interactive interface for the variables, types of plots, etc.
  • Placing callouts for the focus languages in bubble charts.

AI vision via Wolfram Language

Introduction

In the fall of 2023 OpenAI introduced the image vision model “gpt-4-vision-preview”, [OAIb1].

The model “gpt-4-vision-preview” represents a significant enhancement to the GPT-4 model, providing developers and AI enthusiasts with a more versatile tool capable of interpreting and narrating images alongside text. This development opens up new possibilities for creative and practical applications of AI in various fields.

For example, consider the following Wolfram Language (WL), developer-centric applications:

  • Narration of UML diagrams
  • Code generation from narrated (and suitably tweaked) narrations of architecture diagrams and charts
  • Generating presentation content draft from slide images
  • Extracting information from technical plots
  • etc.

A more diverse set of the applications would be:

  • Dental X-ray images narration
  • Security or baby camera footage narration
    • How many people or cars are seen, etc.
  • Transportation trucks content descriptions
    • Wood logs, alligators, boxes, etc.
  • Web page visible elements descriptions
    • Top menu, biggest image seen, etc.
  • Creation of recommender systems for image collections
    • Based on both image features and image descriptions
  • etc.

As a first concrete example, consider the following image that fable-dramatizes the name “Wolfram” (https://i.imgur.com/UIIKK9w.jpg):

RemoveBackground@Import[URL["https://i.imgur.com/UIIKK9wl.jpg"]]
1xg1w9gct6yca

Here is its narration:

LLMVisionSynthesize["Describe very concisely the image", "https://i.imgur.com/UIIKK9w.jpg", "MaxTokens" -> 600]

You are looking at a stylized black and white illustration of a wolf and a ram running side by side among a forest setting, with a group of sheep in the background. The image has an oval shape.

Remark: In this notebook Mathematica and Wolfram Language (WL) are used as synonyms.

Remark: This notebook is the WL version of the notebook “AI vision via Raku”, [AA3].

Ways to use with WL

There are five ways to utilize image interpretation (or vision) services in WL:

  • Dedicated Web API functions, [MT1, CWp1]
  • LLM synthesizing, [AAp1, WRIp1]
  • LLM functions, [AAp1, WRIp1]
  • Dedicated notebook cell type, [AAp2, AAv1]
  • Any combinations of the above

In this document are demonstrated the second, third, and fifth. The first one is demonstrated in the Wolfram Community post “Direct API access to new features of GPT-4 (including vision, DALL-E, and TTS)” by Marco Thiel, [MT1]. The fourth one is still “under design and consideration.”

Remark: The model “gpt-4-vision-preview” is given as a “chat completion model” , therefore, in this document we consider it to be a Large Language Model (LLM).

Packages and paclets

Here we load WL package used below, [AAp1, AAp2, AAp3]:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/LLMVision.m"]

Remark: The package LLMVision is “temporary” – It should be made into a Wolfram repository paclet, or (much better) its functionalities should be included in the “LLMFunctions” framework, [WRIp1].

Images

Here are the links to all images used in this document:

tblImgs = {{Row[{"Wolf and ram running together in forest"}], Row[{"https://i.imgur.com/UIIKK9w.jpg", ""}]}, {Row[{"LLM", " ", "functionalities", " ", "mind-map", ""}], Row[{"https://i.imgur.com/kcUcWnql.jpg", ""}]}, {Row[{"Single", " ", "sightseer", ""}], Row[{"https://i.imgur.com/LEGfCeql.jpg", ""}]}, {Row[{"Three", " ", "hunters", ""}], Row[{"https://raw.githubusercontent.com/antononcube/Raku-WWW-OpenAI/main/resources/ThreeHunters.jpg", ""}]}, {Row[{"Cyber", " ", "Week", " ", "Spending", " ", "Set", " ", "to", " ", "Hit", " ", "New", " ", "Highs", " ", "in", " ", "2023", ""}], Row[{"https://cdn.statcdn.com/Infographic/images/normal/7045.jpeg", ""}]}};
tblImgs = Map[Append[#[[1 ;; 1]], Hyperlink[#[[-1, 1, 1]]]] &, tblImgs];
TableForm[tblImgs, TableHeadings -> {None, {"Name", "Link"}}] /. {ButtonBox[n_, BaseStyle -> "Hyperlink", ButtonData -> { URL[u_], None}] :> Hyperlink[n, URL[u]]}
Name Link
Wolf and ram running together in forest Link
LLM functionalities mind-map Link
Single sightseer Link
Three hunters Link
Cyber Week Spending Set to Hit New Highs in 2023 Link

Document structure

Here is the structure of the rest of the document:

  • LLM synthesizing
    … using multiple image specs of different kind.
  • LLM functions
    … workflows over technical plots.
  • Dedicated notebook cells
    … just excuses why they are not programmed yet.
  • Combinations (fairytale generation)
    … Multi-modal applications for replacing creative types.
  • Conclusions and leftover comments
    … frustrations untold.

LLM synthesizing

The simplest way to use the OpenAI’s vision service is through the function LLMVisionSynthesize of the package “LLMVision”, [AAp1]. (Already demoed in the introduction.)

If the function LLMVisionSynthesize is given a list of images, a textual result corresponding to those images is returned. The argument “images” is a list of image URLs, image file names, or image Base64 representations. (Any combination of those element types can be specified.)

Before demonstrating the vision functionality below we first obtain and show a couple of images.

Images

Here is a URL of an image: (https://i.imgur.com/LEGfCeql.jpg). Here is the image itself:

Import[URL["https://i.imgur.com/LEGfCeql.jpg"]]
1u02ytqvf7xi9

OpenAI’s vision endpoint accepts POST specs that have image URLs or images converted into Base64 strings. When we use the LLMVisionSynthesize function and provide a file name under the “images” argument, the Base64 conversion is automatically applied to that file. Here is an example of how we apply Base64 conversion to the image  from a given file path:

img1 = Import[$HomeDirectory <> "/Downloads/ThreeHunters.jpg"];
ColumnForm[{
   img1, 
   Spacer[10], 
   ExportString[img1, {"Base64", "JPEG"}] // Short}]
0wmip47gloav0

Image narration

Here is an image narration example with the two images above, again, one specified with a URL, the other with a file path:

LLMVisionSynthesize["Give concise descriptions of the images.", {"https://i.imgur.com/LEGfCeql.jpg", $HomeDirectory <> "/Downloads/ThreeHunters.jpg"}, "MaxTokens" -> 600]

1. The first image depicts a single raccoon perched on a tree branch, surrounded by a plethora of vibrant, colorful butterflies in various shades of blue, orange, and other colors, set against a lush, multicolored foliage background.

2. The second image shows three raccoons sitting together on a tree branch in a forest setting, with a warm, glowing light illuminating the scene from behind. The forest is teeming with butterflies, matching the one in the first image, creating a sense of continuity and shared environment between the two scenes.

Description of a mind-map

Here is an application that should be more appealing to WL-developers – getting a description of a technical diagram or flowchart. Well, in this case, it is a mind-map from [AA2]:

Import[URL["https://i.imgur.com/kcUcWnql.jpeg"]]
1ukmn97ui4o98

Here are get the vision model description of the mind-map above (and place the output in Markdown format):

mmDescr = LLMVisionSynthesize["How many branches this mind-map has? Describe each branch separately. Use relevant emoji prefixes.", "https://imgur.com/kcUcWnq.jpeg", "MaxTokens" -> 900]
This mind map has four primary branches, each diverging from a \
central node labeled "LLM functionalities." I will describe each one \
using relevant emoji prefixes:

1. 🖼️ **DALL-E** branch is in yellow and represents an access point to \
the DALL-E service, likely a reference to a Large Language Model \
(LLM) with image generation capabilities.

2. 🤖 **ChatGPT** branch in pink is associated with the ChatGPT \
service, suggesting it's a conversational LLM branch. There are two \
sub-branches:
   - **LLM prompts** indicates a focus on the prompts used to \
communicate with LLMs.
   - **Notebook-wide chats** suggests a feature or functionality for \
conducting chats across an entire notebook environment.

3. 💬 **LLM chat objects** branch in purple implies that there are \
objects specifically designed for chat interactions within LLM \
services.

4. ✍️ **LLM functions** branch in green seems to represent various \
functional aspects or capabilities of LLMs, with a sub-branch:
   - **Chatbooks** which may indicate a feature or tool related to \
managing or organizing chat conversations as books or records.

Converting descriptions to diagrams

Here from the obtained description we request a (new) Mermaid-JS diagram to be generated:

mmdChart = LLMSynthesize[{LLMPrompt["CodeWriter"], "Make the corresponding Mermaid-JS diagram code for the following description. Give the code only, without Markdown symbols.", mmDescr}]
graph TB
    center[LLM functionalities]
    center --> dalle[DALL-E]
    center --> chat[ChatGPT]
    center --> chatobj[LLM chat objects]
    center --> functions[LLM functions]
    chat --> prompts[LLM prompts]
    chat --> notebook[Notebook-wide chats]
    functions --> chatbooks[Chatbooks]

Here is a diagram made with the Mermaid-JS spec obtained above using the resource function of “MermaidInk”, [AAf1]:

ResourceFunction["MermaidInk"][mmdChart]
1qni2g4n8vywf

Below is given an instance of one of the better LLM results for making a Mermaid-JS diagram over the “vision-derived” mind-map description.

ResourceFunction["MermaidInk"]["
graph 
 TBA[LLM services access] --> B[DALL-E]
 A --> C[ChatGPT]
 A --> D[PaLM]
 A --> E[LLM chat objects]
 A --> F[Chatbooks]
 B -->|related to| G[DALL-E AI system]
 C -->|associated with| H[ChatGPT]
 D -->|related to| I[PaLM model]
 E -->|part of| J[chat-related objects/functionalities]
 F -->|implies| K[Feature or application related to chatbooks]
"]
0f0fuo9nexxl8

Code generation from image descriptions

Here is an example of code generation based on the “vision derived” mind-map description above:

LLMSynthesize[{LLMPrompt["CodeWriter"], "Generate the Mathematica code of a graph that corresponds to the description:\n", mmDescr}]
Graph[{"LLM services access" -> "DALL-E","LLM services access" -> "ChatGPT",
"LLM services access" -> "PaLM",
"LLM services access" -> "LLM functionalities",
"LLM services access" -> "Chatbooks","LLM services access" -> "Notebook-wide chats",
"LLM services access" -> "Direct access of LLM services","LLM functionalities" -> "LLM prompts",
"LLM functionalities" -> "LLM functions","LLM functionalities" -> "LLM chat objects"},
VertexLabels -> "Name"]
ToExpression[%]
0cmyq0lep1q7f

Analyzing graphical WL results

Consider another “serious” example – that of analyzing chess play positions. Here we get a chess position using the paclet “Chess”, [WRIp3]:

175o8ba3cxgoh
0scq7lbpp7xfs

Here we describe it with “AI vision”:

LLMVisionSynthesize["Describe the position.", Image[b2], "MaxTokens" -> 1000, "Temperature" -> 0.05]
This is a chess position from a game in progress. Here's the \
description of the position by algebraic notation for each piece:

White pieces:
- King (K) on c1
- Queen (Q) on e2
- Rooks (R) on h1 and a1
- Bishops (B) on e3 and f1
- Knights (N) on g4 and e2
- Pawns (P) on a2, b2, c4, d4, f2, g2, and h2

Black pieces:
- King (K) on e8
- Queen (Q) on e7
- Rooks (R) on h8 and a8
- Bishops (B) on f5 and g7
- Knights (N) on c6 and f6
- Pawns (P) on a7, b7, c7, d7, f7, g7, and h7

It's Black's turn to move. The position suggests an ongoing middle \
game with both sides having developed most of their pieces. White has \
castled queenside, while Black has not yet castled. The white knight \
on g4 is putting pressure on the black knight on f6 and the pawn on \
h7. The black bishop on f5 is active and could become a strong piece \
depending on the continuation of the game.

Remark: In our few experiments with these kind of image narrations, a fair amount of the individual pieces are described to be at wrong chessboard locations.

Remark: In order to make the AI vision more successful, we increased the size of the chessboard frame tick labels, and turned the “a÷h” ticks uppercase (into “A÷H” ticks.) It is interesting to compare the vision results over chess positions with and without that transformation.

LLM Functions

Let us show more programmatic utilization of the vision capabilities.

Here is the workflow we consider:

  1. Ingest an image file and encode it into a Base64 string
  2. Make an LLM configuration with that image string (and a suitable model)
  3. Synthesize a response to a basic request (like, image description)
    • Using LLMSynthesize
  4. Make an LLM function for asking different questions over image
    • Using LLMFunction
  5. Ask questions and verify results
    • ⚠️ Answers to “hard” numerical questions are often wrong.
    • It might be useful to get formatted outputs

Remark: The function LLMVisionSynthesize combines LLMSynthesize and step 2. The function LLMVisionFunction combines LLMFunction and step 2.

Image ingestion and encoding

Here we ingest an image and display it:

imgBarChart = Import[$HomeDirectory <> "/Downloads/Cyber-Week-Spending-Set-to-Hit-New-Highs-in-2023-small.jpeg"]
0iyello2xfyfo

Remark: The image was downloaded from the post “Cyber Week Spending Set to Hit New Highs in 2023” .

Configuration and synthesis

Here we synthesize a response of a image description request:

LLMVisionSynthesize["Describe the image.", imgBarChart, "MaxTokens" -> 600]
The image shows a bar chart infographic titled "Cyber Week Spending \
Set to Hit New Highs in 2023" with a subtitle "Estimated online \
spending on Thanksgiving weekend in the United States." There are \
bars for five years (2019, 2020, 2021, 2022, and 2023) across three \
significant shopping days: Thanksgiving Day, Black Friday, and Cyber \
Monday.

The bars represent the spending amounts, with different colors for \
each year. The spending for 2019 is shown in navy blue, 2020 in a \
lighter blue, 2021 in yellow, 2022 in darker yellow, and 2023 in dark \
yellow, with a pattern that clearly indicates the 2023 data is a \
forecast.

From the graph, one can observe an increasing trend in estimated \
online spending, with the forecast for 2023 being the highest across \
all three days. The graph also has an icon that represents online \
shopping, consisting of a computer monitor with a shopping tag.

At the bottom of the infographic, there is a note that says the \
data's source is Adobe Analytics. The image also contains the \
Statista logo, which indicates that this graphic might have been \
created or distributed by Statista, a company that specializes in \
market and consumer data. Additionally, there are Creative Commons \
(CC) icons, signifying the sharing and use permissions of the graphic.

It's important to note that without specific numbers, I cannot \
provide actual figures, but the visual trend is clear -- \
there is substantial year-over-year growth in online spending during \
these key shopping dates, culminating in a forecasted peak for 2023.

Repeated questioning

Here we define an LLM function that allows multiple question request invocations over the image:

fst = LLMVisionFunction["For the given image answer the question: ``. Be as concise as possible in your answers.", imgBarChart, "MaxTokens" -> 300]
0nmz56wwuboz3
fst["How many years are presented in that image?"]
"Five years are presented in the image."
fst["Which year has the highest value? What is that value?"]
"2023 has the highest value, which is approximately $11B on Cyber Monday."

Remark: Numerical value readings over technical plots or charts seem to be often wrong. Often enough, OpenAI’s vision model warns about this in the responses.

Formatted output

Here we make a function with a specially formatted output that can be more easily integrated in (larger) workflows:

fjs = LLMVisionFunction["How many `1` per `2`? " <> LLMPrompt["NothingElse"]["JSON"], imgBarChart, "MaxTokens" -> 300, "Temperature" -> 0.1]
032vcq74auyv9

Here we invoke that function (in order to get the money per year “seen” by OpenAI’s vision):

res = fjs["money", "shopping day"]
```json
{
  "Thanksgiving Day": {
    "2019": "$4B",
    "2020": "$5B",
    "2021": "$6B",
    "2022": "$7B",
    "2023": "$8B"
  },
  "Black Friday": {
    "2019": "$7B",
    "2020": "$9B",
    "2021": "$9B",
    "2022": "$10B",
    "2023": "$11B"
  },
  "Cyber Monday": {
    "2019": "$9B",
    "2020": "$11B",
    "2021": "$11B",
    "2022": "$12B",
    "2023": "$13B"
  }
}
```

Remark: The above result should be structured as shopping-day:year:value. But occasionally it might be structured as year::shopping-day::value. In the latter case just re-run LLM invocation.

Here we parse the obtained JSON into WL association structure:

aMoney = ImportString[StringReplace[res, {"```json" -> "", "```" -> ""}], "RawJSON"]
<|"Thanksgiving Day" -> <|"2019" -> "$4B", "2020" -> "$5B", 
   "2021" -> "$6B", "2022" -> "$7B", "2023" -> "$8B"|>, 
 "Black Friday" -> <|"2019" -> "$7B", "2020" -> "$9B", 
   "2021" -> "$9B", "2022" -> "$10B", "2023" -> "$11B"|>, 
 "Cyber Monday" -> <|"2019" -> "$9B", "2020" -> "$11B", 
   "2021" -> "$11B", "2022" -> "$12B", "2023" -> "$13B"|>|>

Remark: Currently LLMVisionFunction does not have an interpreter (or “form”) parameter as LLMFunction does. This can be seen as one of the reasons to include LLMVisionFunction in the “LLMFunctions” framework.

Here we convert the money strings into money quantities:

AbsoluteTiming[
  aMoney2 = Map[SemanticInterpretation, aMoney, {-1}] 
 ]
08ijuwuchj31q

Here is the corresponding bar chart and the original bar chart (for
comparison):

0rt43fezbbp4b
1lpfhko7c2g6e

Remark: The comparison shows “pretty good vision” by OpenAI! But, again, small (or maybe significant) discrepancies are observed.

Dedicated notebook cells

In the context of the “well-established” notebook solutions OpenAIMode, [AAp2], or Chatbook,
[WRIp2], we can contemplate extensions to integrate OpenAI’s vision service.

The main challenges here include determining how users will specify images in the notebook, such as through URLs, file names, or Base64 strings, each with unique considerations. Additionally, we have to explore how best to enable users to input prompts or requests for image processing by the AI/LLM service.

This integration, while valuable, it is not my immediate focus as there are programmatic ways to access OpenAI’s vision service already. (See the previous sections.)

Combinations (fairy tale generation)

Consider the following computational workflow for making fairy tales:

  1. Draw or LLM-generate a few images that characterize parts of a story.
  2. Narrate the images using the LLM “vision” functionality.
  3. Use an LLM to generate a story over the narrations.

Remark: Multi-modal LLM / AI systems already combine steps 2 and 3.

Remark: The workflow above (after it is programmed) can be executed multiple times until satisfactory results are obtained.

Here are image generations using DALL-E for four different requests with the same illustrator name in them:

storyImages = 
   Map[
    ImageSynthesize["Painting in the style of John Bauer of " <> #] &,
    {"a girl gets a basket with wine and food for her grandma.", 
     "a big bear meets a girl carrying a basket in the forest.", 
     "a girl that gives food from a basket to a big bear.", 
     "a big bear builds a new house for girl's grandma."} 
   ];
storyImages // Length

(*4*)

Here we display the images:

storyImages
13qqfe3pzqfn9

Here we get the image narrations (via the OpenAI’s “vision service”):

storyImagesDescriptions = LLMVisionSynthesize["Concisely describe the images.", storyImages, "MaxTokens" -> 600]
1. A painting of a woman in a traditional outfit reaching into a
    basket filled with vegetables and bread beside a bottle.
2. An illustration of a person in a cloak holding a bucket and
    standing next to a large bear in a forest.
3. An artwork depicting a person sitting naked by a birch tree,
    sharing a cake with a small bear.
4. A picture of a person in a folk costume sitting next to a bear
    with a ladder leaning against a house.

Here we extract the descriptions into a list:

descr = StringSplit[storyImagesDescriptions, "\n"];

Here we generate the story from the descriptions above (using OpenAI’s ChatGPT):

 LLMSynthesize[{"Write a story that fits the following four descriptions:", Sequence @@ descr}, LLMEvaluator -> LLMConfiguration["MaxTokens" -> 1200]]
In a small village nestled deep within a lush forest, lived a woman \
named Anya. She was gentle and kind-hearted, known for her artistic \
talent and love for nature. Anya had a keen eye for capturing the \
beauty of the world around her through her paintings. Each stroke of \
her brush seemed to hold a piece of her soul, and her art touched the \
hearts of all who laid their eyes upon it.

One sunny day, on the outskirts of the village, Anya set up her easel \
amidst a lively farmers' market. In front of her, she placed a large \
canvas, ready to bring her latest vision to life. With her palette \
filled with vibrant colors, she began painting a woman dressed in a \
traditional outfit, delicately reaching into a woven basket filled to \
the brim with fresh vegetables and warm bread. Beside the basket lay \
an empty bottle, hinting at a joyous feast anticipated for the day.

As Anya skillfully brought her painting to life, a cloak-wrapped \
figure caught her attention. Intrigued, she turned her easel slightly \
to capture the essence of this mysterious wanderer standing beside a \
mighty bear deep within the heart of the forest. In her illustration, \
she depicted the cloaked person, holding a bucket, their gaze met by \
the curious eyes of the regal woodland creature. The bond between \
them was palpable, a silent understanding as they stood together, \
guardians of the ancient woods.

Meanwhile, in a clearing not too far away, Anya discovered a scene \
that touched her deeply. She stumbled upon a person sitting naked \
beneath the shade of a majestic birch tree, a cake placed lovingly \
between them and a small bear. The artwork she created was a tender \
portrayal of the intimate connection shared by the two, a testament \
to the innate kindness that existed between species. Together, they \
enjoyed the sweet treat, their hearts entwined in empathy, neither \
fearing the vulnerability of their exposed selves.

Driven by her artistry, Anya's imagination continued to explore the \
fascinating relationship between humans and bears in her village. In \
her final artwork, she turned her focus to a person in a folk \
costume, sitting comfortably beside a towering bear. A ladder leaned \
against a charming wooden house in the background, illustrating the \
close bond shared between the village folks and their wild \
companions. Together, they stood tall, their spirits entwined in a \
balance of mutual respect and harmony.

As Anya showcased her artwork to the villagers, they were captivated \
by the depth of emotion expressed through her brushstrokes. Her \
paintings served as a reminder that love and understanding knew no \
boundaries, whether lived within the confines of villages or amidst \
the enchanting wilderness.

Anya became a celebrated artist, known far and wide for her ability \
to weave tales of compassion and unity through her exquisite \
paintings. Her work inspired generations to see the world through the \
lens of empathy, teaching them that even in unconventional \
connections between humans and animals, beauty could be found.

And so, her legacy lived on, her art continuing to touch the hearts \
of those who recognized the profound messages hidden within her \
strokes of color. For in every stroke, Anya immortalized the timeless \
bond between humanity and the natural world, forever reminding us of \
the kinship we share with the creatures that roam our earth.

Conclusions and leftover comments

  • The new OpenAI vision model, “gpt-4-vision-preview”, as all LLMs produces too much words, and it has to be reined in and restricted.
  • The functions LLMVisionSynthesize and LLMVisionFunction have to be part of the “LLMFunctions” framework.
    • For example, “LLMVision*” functions do not have an interpreter (or “form”) argument.
  • The package “LLMVision” is meant to be simple and direct, not covering all angles.
  • It would be nice a dedicated notebook cell interface and workflow(s) for interacting with “AI vision” services to be designed and implemented.
    • The main challenge is the input of images.
  • Generating code from hand-written diagrams might be really effective demo using WL.
  • It would be interesting to apply the “AI vision” functionalities over displays from, say, chess or play-cards paclets.

References

Articles

[AA1] Anton Antonov, “Workflows with LLM functions (in WL)”,​ August 4, (2023), Wolfram Community, STAFF PICKS.

[AA2] Anton Antonov, “Raku, Python, and Wolfram Language over LLM functionalities”, (2023), Wolfram Community.

[AA3] Anton Antonov, “AI vision via Raku”, (2023), Wolfram Community.

[MT1] Marco Thiel, “Direct API access to new features of GPT-4 (including vision, DALL-E, and TTS)​​”, November 8, (2023), Wolfram Community, STAFF PICKS.

[OAIb1] OpenAI team, “New models and developer products announced at DevDay” , (2023), OpenAI/blog .

Functions, packages, and paclets

[AAf1] Anton Antonov, MermaidInk, WL function, (2023), Wolfram Function Repository.

[AAp1] Anton Antonov, LLMVision.m, Mathematica package, (2023), GitHub/antononcube .

[AAp2] Anton Antonov, OpenAIMode, WL paclet, (2023), Wolfram Language Paclet Repository.

[AAp3] Anton Antonov, OpenAIRequest.m, Mathematica package, (2023), GitHub/antononcube .

[CWp1] Christopher Wolfram, OpenAILink, WL paclet, (2023), Wolfram Language Paclet Repository.

[WRIp1] Wolfram Research, Inc., LLMFunctions, WL paclet, (2023), Wolfram Language Paclet Repository.

[WRIp2] Wolfram Research, Inc., Chatbook, WL paclet, (2023), Wolfram Language Paclet Repository.

[WRIp3] Wolfram Research, Inc., Chess, WL paclet, (2023), Wolfram Language Paclet Repository.

Videos

[AAv1] Anton Antonov, “OpenAIMode demo (Mathematica)”, (2023), YouTube/@AAA4Prediction.

Re-exploring the structure of Chinese character images

Introduction

In this notebook we show information retrieval and clustering
techniques over images of Unicode collection of Chinese characters. Here
is the outline of notebook’s exposition:

  1. Get Chinese character images.
  2. Cluster “image vectors” and demonstrate that the obtained
    clusters have certain explainability elements.
  3. Apply Latent Semantic Analysis (LSA) workflow to the character
    set.
  4. Show visual thesaurus through a recommender system. (That uses
    Cosine similarity.)
  5. Discuss graph and hierarchical clustering using LSA matrix
    factors.
  6. Demonstrate approximation of “unseen” character images with an
    image basis obtained through LSA over a small set of (simple)
    images.
  7. Redo character approximation with more “interpretable” image
    basis.

Remark: This notebook started as an (extended)
comment for the Community discussion “Exploring
structure of Chinese characters through image processing”
, [SH1].
(Hence the title.)

Get Chinese character images

This code is a copy of the code in the original
Community post by Silvia Hao
, [SH1]:

0zu4hv95x0jjf
Module[{fsize = 50, width = 64, height = 64}, 
  lsCharIDs = Map[FromCharacterCode[#, "Unicode"] &, 16^^4E00 - 1 + Range[width height]]; 
 ]
charPage = Module[{fsize = 50, width = 64, height = 64}, 
    16^^4E00 - 1 + Range[width height] // pipe[
      FromCharacterCode[#, "Unicode"] & 
      , Characters, Partition[#, width] & 
      , Grid[#, Background -> Black, Spacings -> {0, 0}, ItemSize -> {1.5, 1.2}, Alignment -> {Center, Center}, Frame -> All, FrameStyle -> Directive[Red, AbsoluteThickness[3 \[Lambda]]]] & 
      , Style[#, White, fsize, FontFamily -> "Source Han Sans CN", FontWeight -> "ExtraLight"] & 
      , Rasterize[#, Background -> Black] & 
     ] 
   ];
chargrid = charPage // ColorDistance[#, Red] & // Image[#, "Byte"] & // Sign //Erosion[#, 5] &;
lmat = chargrid // MorphologicalComponents[#, Method -> "BoundingBox", CornerNeighbors -> False] &;
chars = ComponentMeasurements[{charPage // ColorConvert[#, "Grayscale"] &, lmat}, "MaskedImage", #Width > 10 &] // Values // Map@RemoveAlphaChannel;
chars = Module[{size = chars // Map@ImageDimensions // Max}, ImageCrop[#, {size, size}] & /@ chars];

Here is a sample of the obtained images:

SeedRandom[33];
RandomSample[chars, 5]
1jy9voh5c01lt

Vector representation of
images

Define a function that represents an image into a linear vector space
(of pixels):

Clear[ImageToVector];
ImageToVector[img_Image] := Flatten[ImageData[ColorConvert[img, "Grayscale"]]];
ImageToVector[img_Image, imgSize_] := Flatten[ImageData[ColorConvert[ImageResize[img, imgSize], "Grayscale"]]];
ImageToVector[___] := $Failed;

Show how vector represented images look like:

Table[BlockRandom[
   img = RandomChoice[chars]; 
   ListPlot[ImageToVector[img], Filling -> Axis, PlotRange -> All, PlotLabel -> img, ImageSize -> Medium, AspectRatio -> 1/6], 
   RandomSeeding -> rs], {rs, {33, 998}}]
0cobk7b0m9xcn
\[AliasDelimiter]

Data preparation

In this section we represent the images into a linear vector space.
(In which each pixel is a basis vector.)

Make an association with images:

aCImages = AssociationThread[lsCharIDs -> chars];
Length[aCImages]

(*4096*)

Make flat vectors with the images:

AbsoluteTiming[
  aCImageVecs = ParallelMap[ImageToVector, aCImages]; 
 ]

(*{0.998162, Null}*)

Do matrix plots a random sample of the image vectors:

SeedRandom[32];
MatrixPlot[Partition[#, ImageDimensions[aCImages[[1]]][[2]]]] & /@ RandomSample[aCImageVecs, 6]
07tn6wh5t97j4

Clustering over the image
vectors

In this section we cluster “image vectors” and demonstrate that the
obtained clusters have certain explainability elements. Expected Chinese
character radicals are observed using image multiplication.

Cluster the image vectors and show summary of the clusters
lengths:

SparseArray[Values@aCImageVecs]
1n5cwcrgj2d3m
SeedRandom[334];
AbsoluteTiming[
  lsClusters = FindClusters[SparseArray[Values@aCImageVecs] -> Keys[aCImageVecs], 35, Method -> {"KMeans"}]; 
 ]
Length@lsClusters
ResourceFunction["RecordsSummary"][Length /@ lsClusters]

(*{24.6383, Null}*)

(*35*)
0lvt8mcfzpvhg

For each cluster:

  • Take 30 different small samples of 7 images
  • Multiply the images in each small sample
  • Show three “most black” the multiplication results
SeedRandom[33];
Table[i -> TakeLargestBy[Table[ImageMultiply @@ RandomSample[KeyTake[aCImages, lsClusters[[i]]], UpTo[7]], 30], Total@ImageToVector[#] &, 3], {i, Length[lsClusters]}]
0erc719h7lnzi

Remark: We can see that the clustering above
produced “semantic” clusters – most of the multiplied images show
meaningful Chinese characters radicals and their “expected
positions.”

Here is one of the clusters with the radical “mouth”:

KeyTake[aCImages, lsClusters[[26]]]
131vpq9dabrjo

LSAMon application

In this section we apply the “standard” LSA workflow, [AA1, AA4].

Make a matrix with named rows and columns from the image vectors:

mat = ToSSparseMatrix[SparseArray[Values@aCImageVecs], "RowNames" -> Keys[aCImageVecs], "ColumnNames" -> Automatic]
0jdmyfb9rsobz

The following Latent Semantic Analysis (LSA) monadic pipeline is used
in [AA2, AA2]:

SeedRandom[77];
AbsoluteTiming[
  lsaAllObj = 
    LSAMonUnit[]\[DoubleLongRightArrow]
     LSAMonSetDocumentTermMatrix[mat]\[DoubleLongRightArrow]
     LSAMonApplyTermWeightFunctions["None", "None", "Cosine"]\[DoubleLongRightArrow]
     LSAMonExtractTopics["NumberOfTopics" -> 60, Method -> "SVD", "MaxSteps" -> 15, "MinNumberOfDocumentsPerTerm" -> 0]\[DoubleLongRightArrow]
     LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
     LSAMonEcho[Style["Obtained basis:", Bold, Purple]]\[DoubleLongRightArrow]
     LSAMonEchoFunctionContext[ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@SparseArray[#H] &]; 
 ]
088nutsaye7yl
0j7joulwrnj30
(*{7.60828, Null}*)

Remark: LSAMon’s corresponding theory and design are
discussed in [AA1, AA4]:

Get the representation matrix:

W2 = lsaAllObj\[DoubleLongRightArrow]LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]LSAMonTakeW
1nno5c4wmc83q

Get the topics matrix:

H = lsaAllObj\[DoubleLongRightArrow]LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]LSAMonTakeH
1gtqe0ihshi9s

Cluster the reduced dimension
representations
and show summary of the clusters
lengths:

AbsoluteTiming[
  lsClusters = FindClusters[Normal[SparseArray[W2]] -> RowNames[W2], 40, Method -> {"KMeans"}]; 
 ]
Length@lsClusters
ResourceFunction["RecordsSummary"][Length /@ lsClusters]

(*{2.33331, Null}*)

(*40*)
1bu5h88uiet3e

Show cluster interpretations:

AbsoluteTiming[aAutoRadicals = Association@Table[i -> TakeLargestBy[Table[ImageMultiply @@ RandomSample[KeyTake[aCImages, lsClusters[[i]]], UpTo[8]], 30], Total@ImageToVector[#] &, 3], {i, Length[lsClusters]}]; 
 ]
aAutoRadicals

(*{0.878406, Null}*)
05re59k8t4u4u

Using FeatureExtraction

I experimented with clustering and approximation using WL’s function
FeatureExtraction.
Result are fairly similar as the above; timings a different (a few times
slower.)

Visual thesaurus

In this section we use Cosine similarity to find visual nearest
neighbors of Chinese character images.

matPixels = WeightTermsOfSSparseMatrix[lsaAllObj\[DoubleLongRightArrow]LSAMonTakeWeightedDocumentTermMatrix, "IDF", "None", "Cosine"];
matTopics = WeightTermsOfSSparseMatrix[lsaAllObj\[DoubleLongRightArrow]LSAMonNormalizeMatrixProduct[Normalized -> Left]\[DoubleLongRightArrow]LSAMonTakeW, "None", "None", "Cosine"];
smrObj = SMRMonUnit[]\[DoubleLongRightArrow]SMRMonCreate[<|"Topic" -> matTopics, "Pixel" -> matPixels|>];

Consider the character “團”:

aCImages["團"]
0pi2u9ejqv9wd

Here are the nearest neighbors for that character found by using both
image topics and image pixels:

(*focusItem=RandomChoice[Keys@aCImages];*)
  focusItem = {"團", "仼", "呔"}[[1]]; 
   smrObj\[DoubleLongRightArrow]
     SMRMonEcho[Style["Nearest neighbors by pixel topics:", Bold, Purple]]\[DoubleLongRightArrow]
     SMRMonSetTagTypeWeights[<|"Topic" -> 1, "Pixel" -> 0|>]\[DoubleLongRightArrow]
     SMRMonRecommend[focusItem, 8, "RemoveHistory" -> False]\[DoubleLongRightArrow]
     SMRMonEchoValue\[DoubleLongRightArrow]
     SMRMonEchoFunctionValue[AssociationThread[Values@KeyTake[aCImages, Keys[#]], Values[#]] &]\[DoubleLongRightArrow]
     SMRMonEcho[Style["Nearest neighbors by pixels:", Bold, Purple]]\[DoubleLongRightArrow]
     SMRMonSetTagTypeWeights[<|"Topic" -> 0, "Pixel" -> 1|>]\[DoubleLongRightArrow]
     SMRMonRecommend[focusItem, 8, "RemoveHistory" -> False]\[DoubleLongRightArrow]
     SMRMonEchoFunctionValue[AssociationThread[Values@KeyTake[aCImages, Keys[#]], Values[#]] &];
1l9yz2e8pvlyl
03bc668vzyh4v
00ecjkyzm4e2s
1wsyx76kjba1g
18wdi99m1k99j

Remark: Of course, in the recommender pipeline above
we can use both pixels and pixels topics. (With their contributions
being weighted.)

Graph clustering

In this section we demonstrate the use of graph communities to find
similar groups of Chinese characters.

Here we take a sub-matrix of the reduced dimension matrix computed
above:

W = lsaAllObj\[DoubleLongRightArrow]LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]LSAMonTakeW;

Here we find the similarity matrix between the characters and remove
entries corresponding to “small” similarities:

matSym = Clip[W . Transpose[W], {0.78, 1}, {0, 1}];

Here we plot the obtained (clipped) similarity matrix:

MatrixPlot[matSym]
1nvdb26265li6

Here we:

  • Take array rules of the sparse similarity matrix
  • Drop the rules corresponding to the diagonal elements
  • Convert the keys of rules into uni-directed graph edges
  • Make the corresponding graph
  • Find graph’s connected components
  • Show the number of connected components
  • Show a tally of the number of nodes in the components
gr = Graph[UndirectedEdge @@@ DeleteCases[Union[Sort /@ Keys[SSparseMatrixAssociation[matSym]]], {x_, x_}]];
lsComps = ConnectedComponents[gr];
Length[lsComps]
ReverseSortBy[Tally[Length /@ lsComps], First]

(*138*)

(*{{1839, 1}, {31, 1}, {27, 1}, {16, 1}, {11, 2}, {9, 2}, {8, 1}, {7, 1}, {6, 5}, {5, 3}, {4, 8}, {3, 14}, {2, 98}}*)

Here we demonstrate the clusters of Chinese characters make
sense:

aPrettyRules = Dispatch[Map[# -> Style[#, FontSize -> 36] &, Keys[aCImages]]]; CommunityGraphPlot[Subgraph[gr, TakeLargestBy[lsComps, Length, 10][[2]]], Method -> "SpringElectrical", VertexLabels -> Placed["Name", Above],AspectRatio -> 1, ImageSize -> 1000] /. aPrettyRules
1c0w4uhnyn2jx

Remark: By careful observation of the clusters and
graph connections we can convince ourselves that the similarities are
based on pictorial sub-elements (i.e. radicals) of the characters.

Hierarchical clustering

In this section we apply hierarchical clustering to the reduced
dimension representation of the Chinese character images.

Here we pick a cluster:

lsFocusIDs = lsClusters[[12]];
Magnify[ImageCollage[Values[KeyTake[aCImages, lsFocusIDs]]], 0.4]
14cnicsw2rvrt

Here is how we can make a dendrogram plot (not that useful here):

(*smat=W2\[LeftDoubleBracket]lsClusters\[LeftDoubleBracket]13\[RightDoubleBracket],All\[RightDoubleBracket];
Dendrogram[Thread[Normal[SparseArray[smat]]->Map[Style[#,FontSize->16]&,RowNames[smat]]],Top,DistanceFunction->EuclideanDistance]*)

Here is a heat-map plot with hierarchical clustering dendrogram (with
tool-tips):

gr = HeatmapPlot[W2[[lsFocusIDs, All]], DistanceFunction -> {CosineDistance, None}, Dendrogram -> {True, False}];
gr /. Map[# -> Tooltip[Style[#, FontSize -> 16], Style[#, Bold, FontSize -> 36]] &, lsFocusIDs]
0vz82un57054q

Remark: The plot above has tooltips with larger
character images.

Representing
all characters with smaller set of basic ones

In this section we demonstrate that a relatively small set of simpler
Chinese character images can be used to represent (or approxumate) the
rest of the images.

Remark: We use the following heuristic: the simpler
Chinese characters have the smallest amount of white pixels.

Obtain a training set of images – that are the darkest – and show a
sample of that set :

{trainingInds, testingInds} = TakeDrop[Keys[SortBy[aCImages, Total[ImageToVector[#]] &]], 800];
SeedRandom[3];
RandomSample[KeyTake[aCImages, trainingInds], 12]
10275rv8gn1qt

Show all training characters with an image collage:

Magnify[ImageCollage[Values[KeyTake[aCImages, trainingInds]], Background -> Gray, ImagePadding -> 1], 0.4]
049bs0w0x26jw

Apply LSA monadic pipeline with the training characters only:

SeedRandom[77];
AbsoluteTiming[
  lsaPartialObj = 
    LSAMonUnit[]\[DoubleLongRightArrow]
     LSAMonSetDocumentTermMatrix[SparseArray[Values@KeyTake[aCImageVecs, trainingInds]]]\[DoubleLongRightArrow]
     LSAMonApplyTermWeightFunctions["None", "None", "Cosine"]\[DoubleLongRightArrow]
     LSAMonExtractTopics["NumberOfTopics" -> 80, Method -> "SVD", "MaxSteps" -> 120, "MinNumberOfDocumentsPerTerm" -> 0]\[DoubleLongRightArrow]
     LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
     LSAMonEcho[Style["Obtained basis:", Bold, Purple]]\[DoubleLongRightArrow]
     LSAMonEchoFunctionContext[ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@SparseArray[#H] &]; 
 ]
0i509m9n2d2p8
1raokwq750nyi
(*{0.826489, Null}*)

Get the matrix and basis interpretation of the extracted image
topics:

H = 
   lsaPartialObj\[DoubleLongRightArrow]
    LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
    LSAMonTakeH;
lsBasis = ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@ SparseArray[H];

Approximation of “unseen”
characters

Pick a Chinese character image as a target image and pre-process
it:

ind = RandomChoice[testingInds];
imgTest = aCImages[ind];
matImageTest = ToSSparseMatrix[SparseArray@List@ImageToVector[imgTest, ImageDimensions[aCImages[[1]]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic];
imgTest
15qkrj0nw08mv

Find its representation with the chosen feature extractor (LSAMon
object here):

matReprsentation = lsaPartialObj\[DoubleLongRightArrow]LSAMonRepresentByTopics[matImageTest]\[DoubleLongRightArrow]LSAMonTakeValue;
lsCoeff = Normal@SparseArray[matReprsentation[[1, All]]];
ListPlot[MapIndexed[Tooltip[#1, lsBasis[[#2[[1]]]]] &, lsCoeff], Filling -> Axis, PlotRange -> All]
0cn7ty6zf3mgo

Show representation coefficients outliers:

lsBasis[[OutlierPosition[Abs[lsCoeff], TopOutliers@*HampelIdentifierParameters]]]
1w6jkhdpxlxw8

Show the interpretation of the found representation:

vecReprsentation = lsCoeff . SparseArray[H];
reprImg = Image[Unitize@Clip[#, {0.45, 1}, {0, 1}] &@Rescale[Partition[vecReprsentation, ImageDimensions[aCImages[[1]]][[1]]]]];
{reprImg, imgTest}
0c84q1hscjubu

See the closest characters using image distances:

KeyMap[# /. aCImages &, TakeSmallest[ImageDistance[reprImg, #] & /@ aCImages, 4]]
1vtcw1dhzlet5

Remark: By applying the approximation procedure to
all characters in testing set we can convince ourselves that small,
training set provides good retrieval. (Not done here.)

Finding more interpretable
bases

In this section we show how to use LSA workflow with Non-Negative
Matrix Factorization (NNMF)
over an image set extended with already
extracted “topic” images.

Cleaner automatic radicals

aAutoRadicals2 = Map[Dilation[Binarize[DeleteSmallComponents[#]], 0.5] &, First /@ aAutoRadicals]
10eg2eaajgiit

Here we take an image union in order to remove the “duplicated”
radicals:

aAutoRadicals3 = AssociationThread[Range[Length[#]], #] &@Union[Values[aAutoRadicals2], SameTest -> (ImageDistance[#1, #2] < 14.5 &)]
1t09xi5nlycaw

LSAMon pipeline with NNMF

Make a matrix with named rows and columns from the image vectors:

mat1 = ToSSparseMatrix[SparseArray[Values@aCImageVecs], "RowNames" -> Keys[aCImageVecs], "ColumnNames" -> Automatic]
0np1umfcks9hm

Enhance the matrix with radicals instances:

mat2 = ToSSparseMatrix[SparseArray[Join @@ Map[Table[ImageToVector[#], 100] &, Values[aAutoRadicals3]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic];
mat3 = RowBind[mat1, mat2];

Apply the LSAMon workflow pipeline with NNMF for topic
extraction:

SeedRandom[77];
AbsoluteTiming[
  lsaAllExtendedObj = 
    LSAMonUnit[]\[DoubleLongRightArrow]
     LSAMonSetDocumentTermMatrix[mat3]\[DoubleLongRightArrow]
     LSAMonApplyTermWeightFunctions["None", "None", "Cosine"]\[DoubleLongRightArrow]
     LSAMonExtractTopics["NumberOfTopics" -> 60, Method -> "NNMF", "MaxSteps" -> 15, "MinNumberOfDocumentsPerTerm" -> 0]\[DoubleLongRightArrow]
     LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
     LSAMonEcho[Style["Obtained basis:", Bold, Purple]]\[DoubleLongRightArrow]
     LSAMonEchoFunctionContext[ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@SparseArray[#H] &]; 
 ]
1mc1fa16ylzcu
1c6p7pzemk6qx
(*{155.289, Null}*)

Remark: Note that NNMF “found” the interpretable
radical images we enhanced the original image set with.

Get the matrix and basis interpretation of the extracted image
topics:

H = 
   lsaAllExtendedObj\[DoubleLongRightArrow]
    LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
    LSAMonTakeH;
lsBasis = ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@ SparseArray[H];

Approximation

Pick a Chinese character image as a target image and pre-process
it:

SeedRandom[43];
ind = RandomChoice[testingInds];
imgTest = aCImages[ind];
matImageTest = ToSSparseMatrix[SparseArray@List@ImageToVector[imgTest, ImageDimensions[aCImages[[1]]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic];
imgTest
1h2aitm71mnl5

Find its representation with the chosen feature extractor (LSAMon
object here):

matReprsentation = lsaAllExtendedObj\[DoubleLongRightArrow]LSAMonRepresentByTopics[matImageTest]\[DoubleLongRightArrow]LSAMonTakeValue;
lsCoeff = Normal@SparseArray[matReprsentation[[1, All]]];
ListPlot[MapIndexed[Tooltip[#1, lsBasis[[#2[[1]]]]] &, lsCoeff], Filling -> Axis, PlotRange -> All]
084vbifk2zvi3

Show representation coefficients outliers:

lsBasis[[OutlierPosition[Abs[lsCoeff], TopOutliers@*QuartileIdentifierParameters]]]
06xq4p3k31fzt

Remark: Note that expected
radical images are in the outliers.

Show the interpretation of the found representation:

vecReprsentation = lsCoeff . SparseArray[H];
reprImg = Image[Unitize@Clip[#, {0.45, 1}, {0, 1}] &@Rescale[Partition[vecReprsentation, ImageDimensions[aCImages[[1]]][[1]]]]];
{reprImg, imgTest}
01xeidbc9qme6

See the closest characters using image distances:

KeyMap[# /. aCImages &, TakeSmallest[ImageDistance[reprImg, #] & /@ aCImages, 4]]
1mrut9izhycrn

Setup

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicLatentSemanticAnalysis.m"];
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicSparseMatrixRecommender.m"];
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]

References

[SH1] Silvia Hao, “Exploring
structure of Chinese characters through image processing”
, (2022),
Wolfram Community.

[AA1] Anton Antonov, “A monad for
Latent Semantic Analysis workflows”
, (2019), Wolfram Community.

[AA2] Anton Antonov, “LSA methods
comparison over random mandalas deconstruction – WL”
, (2022), Wolfram Community.

[AA3] Anton Antonov, “Bethlehem
stars: classifying randomly generated mandalas”
, (2020), Wolfram Community.

[AA4] Anton Antonov, “Random mandalas deconstruction in R, Python, and Mathematica”, (2022), MathematicaForPrediction at WordPress.

[AAp1] Anton Antonov, LSAMon
for Image Collections Mathematica package
, (2022), MathematicaForPrediction
at GitHub
.

Random mandalas deconstruction in R, Python, and Mathematica

Today (2022-02-28) I gave a presentation Greater Boston useR Meetup titled “Random mandalas deconstruction with R, Python, and Mathematica”. (Link to the video recording.)


Here is the abstract:

In this presentation we discuss the application of different dimension reduction algorithms over collections of random mandalas. We discuss and compare the derived image bases and show how those bases explain the underlying collection structure. The presented techniques and insights (1) are applicable to any collection of images, and (2) can be included in larger, more complicated machine learning workflows. The former is demonstrated with a handwritten digits recognition
application; the latter with the generation of random Bethlehem stars. The (parallel) walk-through of the core demonstration is in all three programming languages: Mathematica, Python, and R.


Here is the related RStudio project: “RandomMandalasDeconstruction”.

Here is a link to the R-computations notebook converted to HTML: “LSA methods comparison in R”.

The Mathematica notebooks are placed in project’s folder “notebooks-WL”.


See the work plan status in the org-mode file “Random-mandalas-deconstruction-presentation-work-plan.org”.

Here is the mind-map for the presentation:


The comparison workflow implemented in the notebooks of this project is summarized in the following flow chart:

Random mandalas deconstruction workflow


References

Articles

[AA1] Anton Antonov, “Comparison of dimension reduction algorithms over mandala images generation”, (2017), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “Handwritten digits recognition by matrix factorization”, (2016), MathematicaForPrediction at WordPress.

Mathematica packages and repository functions

[AAp1] Anton Antonov, Monadic Latent Semantic Analysis Mathematica package, (2017), MathematicaForPrediction at GitHub/antononcube.

[AAf1] Anton Antonov, NonNegativeMatrixFactorization, (2019), Wolfram Function Repository.

[AAf2] Anton Antonov, IndependentComponentAnalysis, (2019), Wolfram Function Repository.

[AAf3] Anton Antonov, RandomMandala, (2019), Wolfram Function Repository.

Python packages

[AAp2] Anton Antonov, LatentSemanticAnalyzer Python package (2021), PyPI.org.

[AAp3] Anton Antonov, Random Mandala Python package, (2021), PyPI.org.

R packages

[AAp4] Anton Antonov, Latent Semantic Analysis Monad R package, (2019), R-packages at GitHub/antononcube.

Crypto-currencies data acquisition with visualization

Introduction

In this notebook we show how to obtain crypto-currencies data from several data sources and make some basic time series visualizations. We assume the described data acquisition workflow is useful for doing more detailed (exploratory) analysis.

There are multiple crypto-currencies data sources, but a small proportion of them give a convenient way of extracting crypto-currencies data automatically. I found the easiest to work with to be https://finance.yahoo.com/cryptocurrencies, [YF1]. Another easy to work with Bitcoin-only data source is https://data.bitcoinity.org , [DBO1].

(I also looked into using https://www.coindesk.com/coindesk20. )

Remark: The code below is made with certain ad-hoc inductive reasoning that brought meaningful results. This means the code has to be changed if the underlying data organization in [YF1, DBO1] is changed.

Yahoo! Finance

Getting cryptocurrencies symbols and summaries

In this section we get all crypto-currencies symbols and related metadata.

Get the data of all crypto-currencies in [YF1]:

AbsoluteTiming[
  lsData = Import["https://finance.yahoo.com/cryptocurrencies", "Data"]; 
 ]

(*{6.18067, Null}*)

Locate the data:

pos = First@Position[lsData, {"Symbol", "Name", "Price (Intraday)", "Change", "% Change", ___}];
dsCryptoCurrenciesColumnNames = lsData[[Sequence @@ pos]]
Length[dsCryptoCurrenciesColumnNames]

(*{"Symbol", "Name", "Price (Intraday)", "Change", "% Change", "Market Cap", "Volume in Currency (Since 0:00 UTC)", "Volume in Currency (24Hr)", "Total Volume All Currencies (24Hr)", "Circulating Supply", "52 Week Range", "1 Day Chart"}*)

(*12*)

Get the data:

dsCryptoCurrencies = lsData[[Sequence @@ Append[Most[pos], 2]]];
Dimensions[dsCryptoCurrencies]

(*{25, 10}*)

Make a dataset:

dsCryptoCurrencies = Dataset[dsCryptoCurrencies][All, AssociationThread[dsCryptoCurrenciesColumnNames[[1 ;; -3]], #] &]
027jtuv769fln

Get all time series

In this section we get all the crypto-currencies time series from [YF1].

AbsoluteTiming[
  ccNow = Round@AbsoluteTime[Date[]] - AbsoluteTime[{1970, 1, 1, 0, 0, 0}]; 
  aCryptoCurrenciesDataRaw = 
   Association@
    Map[
     # -> ResourceFunction["ImportCSVToDataset"]["https://query1.finance.yahoo.com/v7/finance/download/" <> # <>"?period1=1410825600&period2=" <> ToString[ccNow] <> "&interval=1d&events=history&includeAdjustedClose=true"] &, Normal[dsCryptoCurrencies[All, "Symbol"]] 
    ]; 
 ]

(*{5.98745, Null}*)

Remark: Note that in the code above we specified the upper limit of the time span to be the current date. (And shifted it with respect to the epoch start 1970-01-01 used by [YF1].)

Check we good the data with dimensions retrieval:

Dimensions /@ aCryptoCurrenciesDataRaw

(*<|"BTC-USD" -> {2468, 7}, "ETH-USD" -> {2144, 7}, "USDT-USD" -> {2307, 7}, "BNB-USD" -> {1426, 7}, "ADA-USD" -> {1358, 7}, "DOGE-USD" -> {2468, 7}, "XRP-USD" -> {2468, 7}, "USDC-USD" -> {986, 7}, "DOT1-USD" -> {304, 7}, "HEX-USD" -> {551, 7}, "UNI3-USD" -> {81, 7},"BCH-USD" -> {1428, 7}, "LTC-USD" -> {2468, 7}, "SOL1-USD" -> {436, 7}, "LINK-USD" -> {1369, 7}, "THETA-USD" -> {1250, 7}, "MATIC-USD" -> {784, 7}, "XLM-USD" -> {2468, 7}, "ICP1-USD" -> {32, 7}, "VET-USD" -> {1052, 7}, "ETC-USD" -> {1792, 7}, "FIL-USD" -> {1285, 7}, "TRX-USD" -> {1376, 7}, "XMR-USD" -> {2468, 7}, "EOS-USD" -> {1450, 7}|>*)

Check we good the data with random sample:

RandomSample[#, 6] & /@ KeyTake[aCryptoCurrenciesDataRaw, RandomChoice[Keys@aCryptoCurrenciesDataRaw]]
12a3tm9n7hwhw

Here we add the crypto-currencies symbols and convert date strings into date objects.

AbsoluteTiming[
  aCryptoCurrenciesData = Association@KeyValueMap[Function[{k, v}, k -> v[All, Join[<|"Symbol" -> k, "DateObject" -> DateObject[#Date]|>, #] &]], aCryptoCurrenciesDataRaw]; 
 ]

(*{8.27865, Null}*)

Summary

In this section we compute the summary over all datasets:

ResourceFunction["RecordsSummary"][Join @@ Values[aCryptoCurrenciesData], "MaxTallies" -> 30]
05np9dmf305fp

Plots

Here we plot the “Low” and “High” price time series for each crypto-currency for the last 120 days:

nDays = 120;
Map[
  Block[{dsTemp = #[Select[AbsoluteTime[#DateObject] > AbsoluteTime[DatePlus[Now, -Quantity[nDays, "Days"]]] &]]}, 
    DateListPlot[{
      Normal[dsTemp[All, {"DateObject", "Low"}][Values]], 
      Normal[dsTemp[All, {"DateObject", "High"}][Values]]}, 
     PlotLegends -> {"Low", "High"}, 
     AspectRatio -> 1/4, 
     PlotRange -> All] 
   ] &, 
  aCryptoCurrenciesData 
 ]
0xx3qb97hg2w1

Here we plot the volume time series for each crypto-currency for the last 120 days:

nDays = 120;
Map[
  Block[{dsTemp = #[Select[AbsoluteTime[#DateObject] > AbsoluteTime[DatePlus[Now, -Quantity[nDays, "Days"]]] &]]}, 
    DateListPlot[{
      Normal[dsTemp[All, {"DateObject", "Volume"}][Values]]}, 
     PlotLabel -> "Volume", 
     AspectRatio -> 1/4, 
     PlotRange -> All] 
   ] &, 
  aCryptoCurrenciesData 
 ]
0djptbh8lhz4e

data.bitcoinity.org

In this section we ingest crypto-currency data from data.bitcoinity.org, [DBO1].

Metadata

In this sub-section we assign different metadata elements used in data.bitcoinity.org.

The currencies and exchanges we obtained by examining the output of:

Import["https://data.bitcoinity.org/markets/price/30d/USD?t=l", "Plaintext"]

Assignments

lsCurrencies = {"all", "AED", "ARS", "AUD", "BRL", "CAD", "CHF", "CLP", "CNY", "COP", "CZK", "DKK", "EUR", "GBP", "HKD", "HRK", "HUF", "IDR", "ILS", "INR", "IRR", "JPY", "KES", "KRW", "MXN", "MYR", "NOK", "NZD", "PHP", "PKR", "PLN", "RON", "RUB", "RUR", "SAR", "SEK", "SGD", "THB", "TRY", "UAH", "USD", "VEF", "XAU", "ZAR"};
lsExchanges = {"all", "bit-x", "bit2c", "bitbay", "bitcoin.co.id", "bitcoincentral", "bitcoinde", "bitcoinsnorway", "bitcurex", "bitfinex", "bitflyer", "bithumb", "bitmarketpl", "bitmex", "bitquick", "bitso", "bitstamp", "btcchina", "btce", "btcmarkets", "campbx", "cex.io", "clevercoin", "coinbase", "coinfloor", "exmo", "gemini", "hitbtc", "huobi", "itbit", "korbit", "kraken", "lakebtc", "localbitcoins", "mercadobitcoin", "okcoin", "paymium", "quadrigacx", "therocktrading", "vaultoro", "wallofcoins"};
lsTimeSpans = {"10m", "1h", "6h", "24h", "3d", "30d", "6m", "2y", "5y", "all"};
lsTimeUnit = {"second", "minute", "hour", "day", "week", "month"};
aDataTypeDescriptions = Association@{"price" -> "Prince", "volume" -> "Trading Volume", "rank" -> "Rank", "bidask_sum" -> "Bid/Ask Sum", "spread" -> "Bid/Ask Spread", "tradespm" -> "Trades Per Minute"};
lsDataTypes = Keys[aDataTypeDescriptions];

Getting BTC data

Here we make a template string that for CSV data retrieval from data.bitcoinity.org:

stDBOURL = StringTemplate["https://data.bitcoinity.org/export_data.csv?currency=`currency`&data_type=`dataType`&exchange=`exchange`&r=`timeUnit`&t=l&timespan=`timeSpan`"]

(*TemplateObject[{"https://data.bitcoinity.org/export_data.csv?currency=", TemplateSlot["currency"], "&data_type=", TemplateSlot["dataType"], "&exchange=", TemplateSlot["exchange"], "&r=", TemplateSlot["timeUnit"], "&t=l&timespan=", TemplateSlot["timeSpan"]}, CombinerFunction -> StringJoin, InsertionFunction -> TextString]*)

Here is an association with default values for the string template above:

aDBODefaultParameters = <|"currency" -> "USD", "dataType" -> "price", "exchange" -> "all", "timeUnit" -> "day", "timeSpan" -> "all"|>;

Remark: The metadata assigned above is used to form valid queries for the query string template.

Remark: Not all combinations of parameters are “fully respected” by data.bitcoinity.org. For example, if a data request is with time granularity that is too fine over a large time span, then the returned data is with coarser granularity.

Price for a particular currency and exchange pair

Here we retrieve data by overwriting the parameters for currency, time unit, time span, and exchange:

dsBTCPriceData = 
  ResourceFunction["ImportCSVToDataset"][stDBOURL[Join[aDBODefaultParameters, <|"currency" -> "EUR", "timeUnit" -> "hour", "timeSpan" -> "7d", "exchange" -> "coinbase"|>]]]
0xcsh7gmkf1q5

Here is a summary:

ResourceFunction["RecordsSummary"][dsBTCPriceData]
0rzy81vbf5o23

Volume data

Here we retrieve data by overwriting the parameters for data type, time unit, time span, and exchange:

dsBTCVolumeData = 
  ResourceFunction["ImportCSVToDataset"][stDBOURL[Join[aDBODefaultParameters, <|"dataType" -> Volume, "timeUnit" -> "day", "timeSpan" -> "30d", "exchange" -> "all"|>]]]
1scvwhiftq8m2

Here is a summary:

ResourceFunction["RecordsSummary"][dsBTCVolumeData]
1bmbadd8up36a

Plots

Price data

Here we extract the non-time columns in the tabular price data obtained above and plot the corresponding time series:

DateListPlot[Association[# -> Normal[dsBTCPriceData[All, {"Time", #}][Values]] & /@Rest[Normal[Keys[dsBTCPriceData[[1]]]]]], AspectRatio -> 1/4, ImageSize -> Large]
136hrgyroy246

Volume data

Here we extract the non-time columns (corresponding to exchanges) in the tabular volume data obtained above and plot the corresponding time series:

DateListPlot[Association[# -> Normal[dsBTCVolumeData[All, {"Time", #}][Values]] & /@ Rest[Normal[Keys[dsBTCVolumeData[[1]]]]]], PlotRange -> All, AspectRatio -> 1/4, ImageSize -> Large]
1tz1hw81b2930

References

[DBO1] https://data.bitcoinity.org.

[WK1] Wikipedia entry, Cryptocurrency.

[YF1] Yahoo! Finance, Cryptocurrencies.

Time series search engines over COVID-19 data

Introduction

In this article we proclaim the preparation and availability of interactive interfaces to two Time Series Search Engines (TSSEs) over COVID-19 data. One TSSE is based on Apple Mobility Trends data, [APPL1]; the other on The New York Times COVID-19 data, [NYT1].

Here are links to interactive interfaces of the TSSEs hosted (and publicly available) at shinyapps.io by RStudio:

Motivation: The primary motivation for making the TSSEs and their interactive interfaces is to use them as exploratory tools. Combined with relevant data analysis (e.g. [AA1, AA2]) the TSSEs should help to form better intuition and feel of the spread of COVID-19 and related data aggregation, public reactions, and government polices.

The rest of the article is structured as follows:

  1. Brief descriptions the overall process, the data
  2. Brief descriptions the search engines structure and implementation
  3. Discussions of a few search examples and their (possible) interpretations

The overall process

For both search engines the overall process has the same steps:

  1. Ingest the data
  2. Do basic (and advanced) data analysis
  3. Make (and publish) reports detailing the data ingestion and transformation steps
  4. Enhance the data with transformed versions of it or with additional related data
  5. Make a Time Series Sparse Matrix Recommender (TSSMR)
  6. Make a Time Series Search Engine Interactive Interface (TSSEII)
  7. Make the interactive interface easily accessible over the World Wide Web

Here is a flow chart that corresponds to the steps listed above:

TSSMRFlowChart

Data

The Apple data

The Apple Mobility Trends data is taken from Apple’s site, see [APPL1]. The data ingestion, basic data analysis, time series seasonality demonstration, (graph) clusterings are given in [AA1]. (Here is a link to the corresponding R-notebook .)

The weather data was taken using the Mathematica function WeatherData, [WRI1].

(It was too much work to get the weather data using some of the well known weather data R packages.)

The New York Times data

The New York Times COVID-19 data is taken from GitHub, see [NYT1]. The data ingestion, basic data analysis, and visualizations are given in [AA2]. (Here is a link to the corresponding R-notebook .)

The search engines

The following sub-sections have screenshots of the TSSE interactive interfaces.

I did experiment with combining the data of the two engines, but did not turn out to be particularly useful. It seems that is more interesting and useful to enhance the Apple data engine with temperature data, and to enhance The New Your Times engine with the (consecutive) differences of the time series.

Structure

The interactive interfaces have three panels:

  • Nearest Neighbors
    • Gives the time series nearest neighbors for the time series of selected entity.
    • Has interactive controls for entity selection and filtering.
  • Trend Finding
    • Gives the time series that adhere to a specified named trend.
    • Has interactive controls for trend curves selection and entity filtering.
  • Notes
    • Gives references and data objects summary.

Implementation

Both TSSEs are implemented using the R packages “SparseMatrixRecommender”, [AAp1], and “SparseMatrixRecommenderInterfaces”, [AAp2].

The package “SparseMatrixRecommender” provides functions to create and use Sparse Matrix Recommender (SMR) objects. Both TSSEs use underlying SMR objects.

The package “SparseMatrixRecommenderInterfaces” provides functions to generate the server and client functions for the Shiny framework by RStudio.

As it was mentioned above, both TSSEs are published at shinyapps.io. The corresponding source codes can be found in [AAr1].

The Apple data TSSE has four types of time series (“entities”). The first three are normalized volumes of Apple maps requests while driving, transit transport use, and walking. (See [AA1] for more details.) The fourth is daily mean temperature at different geo-locations.

Here are screenshots of the panels “Nearest Neighbors” and “Trend Finding” (at interface launch):

AppleTSSENNs

AppleTSSETrends

The New York Times COVID-19 Data Search Engine

The New York Times TSSE has four types of time series (aggregated) cases and deaths, and their corresponding time series differences.

Here are screenshots of the panels “Nearest Neighbors” and “Trend Finding” (at interface launch):

NYTTSSENNs

NYTTSSETrends

Examples

In this section we discuss in some detail several examples of using each of the TSSEs.

Apple data search engine examples

Here are a few observations from [AA1]:

  • The COVID-19 lockdowns are clearly reflected in the time series.
  • The time series from the Apple Mobility Trends data shows strong weekly seasonality. Roughly speaking, people go to places they are not familiar with on Fridays and Saturdays. Other work week days people are more familiar with their trips. Since much lesser number of requests are made on Sundays, we can conjecture that many people stay at home or visit very familiar locations.

Here are a few assumptions:

  • Where people frequently go (work, school, groceries shopping, etc.) they do not need directions that much.
  • People request directions when they have more free time and will for “leisure trips.”
  • During vacations people are more likely to be in places they are less familiar with.
  • People are more likely to take leisure trips when the weather is good. (Warm, not raining, etc.)

Nice, France vs Florida, USA

Consider the results of the Nearest Neighbors panel for Nice, France.

Since French tend to go on vacation in July and August ([SS1, INSEE1]) we can see that driving, transit, and walking in Nice have pronounced peaks during that time:

Of course, we also observe the lockdown period in that geographical area.

Compare those time series with the time series from driving in Florida, USA:

We can see that people in Florida, USA have driving patterns unrelated to the typical weather seasons and vacation periods.

(Further TSSE queries show that there is a negative correlation with the temperature in south Florida and the volumes of Apple Maps directions requests.)

Italy and Balkan countries driving

We can see that according to the data people who have access to both iPhones and cars in Italy and the Balkan countries Bulgaria, Greece, and Romania have similar directions requests patterns:

(The similarities can be explained with at least a few “obvious” facts, but we are going to restrain ourselves.)

The New York Times data search engine examples

In Broward county, Florida, USA and Cook county, Illinois, USA we can see two waves of infections in the difference time series:

References

Data

[APPL1] Apple Inc., Mobility Trends Reports, (2020), apple.com.

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[WRI1] Wolfram Research (2008), WeatherData, Wolfram Language function.

Articles

[AA1] Anton Antonov, “Apple mobility trends data visualization (for COVID-19)”, (2020), SystemModeling at GitHub/antononcube.

[AA2] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub/antononcube.

[INSEE1] Institut national de la statistique et des études économiques, “En 2010, les salariés ont pris en moyenne six semaines de congé”, (2012).

[SS1] Sam Schechner and Lee Harris, “What Happens When All of France Takes Vacation? 438 Miles of Traffic”, (2019), The Wall Street Journal

Packages, repositories

[AAp1] Anton Antonov, Sparse Matrix Recommender framework functions, (2019), R-packages at GitHub/antononcube.

[AAp2] Anton Antonov, Sparse Matrix Recommender framework interface functions, (2019), R-packages at GitHub/antononcube.

[AAr1] Anton Antonov, Coronavirus propagation dynamics, (2020), SystemModeling at GitHub/antononcube.

NY Times COVID-19 data visualization (Update)

Introduction

This post is both an update and a full-blown version of an older post — “NY Times COVID-19 data visualization” — using NY Times COVID-19 data up to 2021-01-13.

The purpose of this document/notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].

The following steps are taken:

  • Ingest data
    • Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
    • Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
  • Merge the data.
  • Make data summaries and related plots.
  • Make corresponding geo-plots.
  • Do “out of the box” time series forecast.
  • Analyze fluctuations around time series trends.

Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].

Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.

Import data

NYTimes USA states data

dsNYDataStates = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"];
dsNYDataStates = dsNYDataStates[All, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataStates[[1 ;; 6]]
18qzu6j67rb6y
ResourceFunction["RecordsSummary"][dsNYDataStates]
0eh58fau8y8r1

NYTimes USA counties data

dsNYDataCounties = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"];
dsNYDataCounties = dsNYDataCounties[All, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataCounties[[1 ;; 6]]
1cpd9bx9xi71h
ResourceFunction["RecordsSummary"][dsNYDataCounties]
1elzwfv0fe32k

US county records

dsUSACountyData = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv"];
dsUSACountyData = dsUSACountyData[All, Join[#, <|"FIPS" -> ToExpression[#FIPS]|>] &];
dsUSACountyData[[1 ;; 6]]
0ycsuwd577vov
ResourceFunction["RecordsSummary"][dsUSACountyData]
0tqfkpq6gxui9

Merge data

Verify that the two datasets have common FIPS codes:

Length[Intersection[Normal[dsUSACountyData[All, "FIPS"]], Normal[dsNYDataCounties[All, "Fips"]]]]

(*3133*)

Merge the datasets:

dsNYDataCountiesExtended = Dataset[JoinAcross[Normal[dsNYDataCounties], Normal[dsUSACountyData[All, {"FIPS", "Lat", "Lon", "Population"}]], Key["Fips"] -> Key["FIPS"]]];

Add a “DateObject” column and (reverse) sort by date:

dsNYDataCountiesExtended = dsNYDataCountiesExtended[All, Join[<|"DateObject" -> DateObject[#Date]|>, #] &];
dsNYDataCountiesExtended = dsNYDataCountiesExtended[ReverseSortBy[#DateObject &]];
dsNYDataCountiesExtended[[1 ;; 6]]
09o5nw7dv2wba

Basic data analysis

We consider cases and deaths for the last date only. (The queries can be easily adjusted for other dates.)

dfQuery = dsNYDataCountiesExtended[Select[#Date == dsNYDataCountiesExtended[1, "Date"] &], {"FIPS", "Cases", "Deaths"}];
dfQuery = dfQuery[All, Prepend[#, "FIPS" -> ToString[#FIPS]] &];
Total[dfQuery[All, {"Cases", "Deaths"}]]

(*<|"Cases" -> 22387340, "Deaths" -> 355736|>*)

Here is the summary of the values of cases and deaths across the different USA counties:

ResourceFunction["RecordsSummary"][dfQuery]
1kdnmrlhe4srx

The following table of plots shows the distributions of cases and deaths and the corresponding Pareto principle adherence plots:

opts = {PlotRange -> All, ImageSize -> Medium};
Rasterize[Grid[
   Function[{columnName}, 
     {Histogram[Log10[#], PlotLabel -> Row[{Log10, Spacer[3], columnName}], opts], ResourceFunction["ParetoPrinciplePlot"][#, PlotLabel -> columnName, opts]} &@Normal[dfQuery[All, columnName]] 
    ] /@ {"Cases", "Deaths"}, 
   Dividers -> All, FrameStyle -> GrayLevel[0.7]]]
13l8k7qfbkr3q

A couple of observations:

  • The logarithms of the cases and deaths have nearly Normal or Logistic distributions.
  • Typical manifestation of the Pareto principle: 80% of the cases and deaths are registered in 20% of the counties.

Remark: The top 20% counties of the cases are not necessarily the same as the top 20% counties of the deaths.

Distributions

Here we find the distributions that correspond to the cases and deaths (using FindDistribution ):

ResourceFunction["GridTableForm"][List @@@ Map[Function[{columnName}, 
     columnName -> FindDistribution[N@Log10[Select[#, # > 0 &]]] &@Normal[dfQuery[All, columnName]] 
    ], {"Cases", "Deaths"}], TableHeadings -> {"Data", "Distribution"}]
10hkfowjmj6oh

Pareto principle locations

The following query finds the intersection between that for the top 600 Pareto principle locations for the cases and deaths:

Length[Intersection @@ Map[Function[{columnName}, Keys[TakeLargest[Normal@dfQuery[Association, #FIPS -> #[columnName] &], 600]]], {"Cases", "Deaths"}]]

(*516*)

Geo-histogram

lsAllDates = Union[Normal[dsNYDataCountiesExtended[All, "Date"]]];
lsAllDates // Length

(*359*)
Manipulate[
  DynamicModule[{ds = dsNYDataCountiesExtended[Select[#Date == datePick &]]}, 
   GeoHistogram[
    Normal[ds[All, {"Lat", "Lon"}][All, Values]] -> N[Normal[ds[All, columnName]]], 
    Quantity[150, "Miles"], PlotLabel -> columnName, PlotLegends -> Automatic, ImageSize -> Large, GeoProjection -> "Equirectangular"] 
  ], 
  {{columnName, "Cases", "Data type:"}, {"Cases", "Deaths"}}, 
  {{datePick, Last[lsAllDates], "Date:"}, lsAllDates}]
1egny238t830i

Heat-map plots

An alternative of the geo-visualization is to use a heat-map plot. Here we use the package “HeatmapPlot.m”, [AAp1].

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]

Cases

Cross-tabulate states with dates over cases:

matSDC = ResourceFunction["CrossTabulate"][dsNYDataStates[All, {"State", "Date", "Cases"}], "Sparse" -> True];

Make a heat-map plot by sorting the columns of the cross-tabulation matrix (that correspond to states):

HeatmapPlot[matSDC, DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000]
1lmgbj4mq4wx9

Deaths

Cross-tabulate states with dates over deaths and plot:

matSDD = ResourceFunction["CrossTabulate"][dsNYDataStates[All, {"State", "Date", "Deaths"}], "Sparse" -> True];
HeatmapPlot[matSDD, DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000]
0g2oziu9g4a8d

Time series analysis

Cases

Time series

For each date sum all cases over the states, make a time series, and plot it:

tsCases = TimeSeries@(List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Cases & /@ #] &]]);
opts = {PlotTheme -> "Detailed", PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
DateListPlot[tsCases, PlotLabel -> "Cases", opts]
1i9aypjaqxdm0
ResourceFunction["RecordsSummary"][tsCases["Path"]]
1t61q3iuq40zn

Logarithmic plot:

DateListPlot[Log10[tsCases], PlotLabel -> Row[{Log10, Spacer[3], "Cases"}], opts]
0r01nxd19xj1x

“Forecast”

Fit a time series model to log 10 of the time series:

tsm = TimeSeriesModelFit[Log10[tsCases]]
1gz0j2673707m

Plot log 10 data and forecast:

DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]
10vx2ydgcpq0c

Plot data and forecast:

DateListPlot[{tsCases, 10^TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]
04qu24g27fzi6

Deaths

Time series

For each date sum all cases over the states, make a time series, and plot it:

tsDeaths = TimeSeries@(List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Deaths & /@ #] &]]);
opts = {PlotTheme -> "Detailed", PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
DateListPlot[tsDeaths, PlotLabel -> "Deaths", opts]
1uc6wpre2zxl3
ResourceFunction["RecordsSummary"][tsDeaths["Path"]]
1olawss0k1gvd

“Forecast”

Fit a time series model:

tsm = TimeSeriesModelFit[tsDeaths, "ARMA"]
0e5p4c2hxhahd

Plot data and forecast:

DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]
06uurgguaxyg9

Fluctuations

We want to see does the time series data have fluctuations around its trends and estimate the distributions of those fluctuations. (Knowing those distributions some further studies can be done.)

This can be efficiently using the software monad QRMon, [AAp2, AA1]. Here we load the QRMon package:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicQuantileRegression.m"]

Fluctuations presence

Here we plot the consecutive differences of the cases:

DateListPlot[Differences[tsCases], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]
1typufai7chn8

Here we plot the consecutive differences of the deaths:

DateListPlot[Differences[tsDeaths], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]
0wqagqqfj3p7l

From the plots we see that time series are not monotonically increasing, and there are non-trivial fluctuations in the data.

Absolute and relative errors distributions

Here we take interesting part of the cases data:

tsData = TimeSeriesWindow[tsCases, {{2020, 5, 1}, {2020, 12, 31}}];

Here we specify QRMon workflow that rescales the data, fits a B-spline curve to get the trend, and finds the absolute and relative errors (residuals, fluctuations) around that trend:

qrObj = 
   QRMonUnit[tsData]⟹
    QRMonEchoDataSummary⟹
    QRMonRescale[Axes -> {False, True}]⟹
    QRMonEchoDataSummary⟹
    QRMonQuantileRegression[16, 0.5]⟹
    QRMonSetRegressionFunctionsPlotOptions[{PlotStyle -> Red}]⟹
    QRMonDateListPlot[AspectRatio -> 1/4, ImageSize -> Large]⟹
    QRMonErrorPlots["RelativeErrors" -> False, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True]⟹
    QRMonErrorPlots["RelativeErrors" -> True, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True];
0mcebeqra4iqj
0lz7fflyitth2
0ke1wkttei4a3
0smqxx82ytyjq
1ct1s3qemddsi

Here we find the distribution of the absolute errors (fluctuations) using FindDistribution:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> False]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[All, 2]]]

(*CauchyDistribution[6.0799*10^-6, 0.000331709]*)

Absolute errors distributions for the last 90 days:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> False]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[-90 ;; -1, 2]]]

(*ExtremeValueDistribution[-0.000996315, 0.00207593]*)

Here we find the distribution of the of the relative errors:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> True]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[All, 2]]]

(*StudentTDistribution[0.0000511326, 0.00244023, 1.59364]*)

Relative errors distributions for the last 90 days:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> True]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[-90 ;; -1, 2]]]

(*NormalDistribution[9.66949*10^-6, 0.00394395]*)

References

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[WRI1] Wolfram Research Inc., USA county records, (2020), System Modeling at GitHub.

[JH1] CSSE at Johns Hopkins University, COVID-19, (2020), GitHub.

[VK1] Vitaliy Kaurov, Resources For Novel Coronavirus COVID-19, (2020), community.wolfram.com.

[AA1] Anton Antonov, “A monad for Quantile Regression workflows”, (2018), at MathematicaForPrediction WordPress.

[AAp1] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.

[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediciton at GitHub.

Generation of Random Bethlehem Stars

Introduction

This document/notebook is inspired by the Mathematica Stack Exchange (MSE) question “Plotting the Star of Bethlehem”, [MSE1]. That MSE question requests efficient and fast plotting of a certain mathematical function that (maybe) looks like the Star of Bethlehem, [Wk1]. Instead of doing what the author of the questions suggests, I decided to use a generative art program and workflows from three of most important Machine Learning (ML) sub-cultures: Latent Semantic Analysis, Recommendations, and Classification.

Although we discuss making of Bethlehem Star-like images, the ML workflows and corresponding code presented in this document/notebook have general applicability – in many situations we have to make classifiers based on data that has to be “feature engineered” through pipeline of several types of ML transformative workflows and that feature engineering requires multiple iterations of re-examinations and tuning in order to achieve the set goals.

The document/notebook is structured as follows:

  1. Target Bethlehem Star images
  2. Simplistic approach
  3. Elaborated approach outline
  4. Sections that follow through elaborated approach outline:
    1. Data generation
    2. Feature extraction
    3. Recommender creation
    4. Classifier creation and utilization experiments

(This document/notebook is a “raw” chapter for the book “Simplified Machine Learning Workflows”, [AAr3].)

Target images

Here are the images taken from [MSE1] that we consider to be “Bethlehem Stars” in this document/notebook:

imgStar1 = Import["https://i.stack.imgur.com/qmmOw.png"];
imgStar2 = Import["https://i.stack.imgur.com/5gtsS.png"];
Row[{imgStar1, Spacer[5], imgStar2}]
00dxgln7hhmjl

We notice that similar images can be obtained using the Wolfram Function Repository (WFR) function RandomMandala, [AAr1]. Here are a dozen examples:

SeedRandom[5];
Multicolumn[Table[MandalaToWhiterImage@ResourceFunction["RandomMandala"]["RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> RandomInteger[{2, 8}], "ConnectingFunction" -> FilledCurve@*BezierCurve], 12], 6, Background -> Black]
0dwkbztss087q

Simplistic approach

We can just generate a large enough set of mandalas and pick the ones we like.

More precisely we have the following steps:

  1. We generate, say, 200 random mandalas using BlockRandom and keeping track of the random seeds
    1. The mandalas are generated with rotational symmetry order 2 and filled Bezier curve connections.
  2. We pick mandalas that look, more or less, like Bethlehem Stars
  3. Add picked mandalas to the results list
  4. If too few mandalas are in the results list go to 1.

Here are some mandalas generated with those steps:

lsStarReferenceSeeds = DeleteDuplicates@{697734, 227488491, 296515155601, 328716690761, 25979673846, 48784395076, 61082107304, 63772596796, 128581744446, 194807926867, 254647184786, 271909611066, 296515155601, 575775702222, 595562118302, 663386458123, 664847685618, 680328164429, 859482663706};
Multicolumn[
  Table[BlockRandom[ResourceFunction["RandomMandala"]["RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> Automatic, "ConnectingFunction" -> FilledCurve@*BezierCurve, ColorFunction -> (White &), Background -> Black], RandomSeeding -> rs], {rs, lsStarReferenceSeeds}] /. GrayLevel[0.25`] -> White, 6, Appearance -> "Horizontal", Background -> Black]
1aedatd1zb3fh

Remark: The plot above looks prettier in notebook converted with the resource function DarkMode.

Elaborated approach

Assume that we want to automate the simplistic approach described in the previous section.

One way to automate is to create a Machine Learning (ML) classifier that is capable of discerning which RandomMandala objects look like Bethlehem Star target images and which do not. With such a classifier we can write a function BethlehemMandala that applies the classifier on multiple results from RandomMandala and returns those mandalas that the classifier says are good.

Here are the steps of building the proposed classifier:

  • Generate a large enough Random Mandala Images Set (RMIS)
  • Create a feature extractor from a subset of RMIS
  • Assign features to all of RMIS
  • Make a recommender with the RMIS features and other image data (like pixel values)
  • Apply the RMIS recommender over the target Bethlehem Star images and determine and examine image sets that are:
    • the best recommendations
    • the worse recommendations
  • With the best and worse recommendations sets compose training data for classifier making
  • Train a classifier
  • Examine classifier application to (filtering of) random mandala images (both in RMIS and not in RMIS)
  • If the results are not satisfactory redo some or all of the steps above

Remark: If the results are not satisfactory we should consider using the obtained classifier at the data generation phase. (This is not done in this document/notebook.)

Remark: The elaborated approach outline and flow chart have general applicability, not just for generation of random images of a certain type.

Flow chart

Here is a flow chart that corresponds to the outline above:

09agsmbmjhhv4

A few observations for the flow chart follow:

  • The flow chart has a feature extraction block that shows that the feature extraction can be done in several ways.
    • The application of LSA is a type of feature extraction which this document/notebook uses.
  • If the results are not good enough the flow chart shows that the classifier can be used at the data generation phase.
  • If the results are not good enough there are several alternatives to redo or tune the ML algorithms.
    • Changing or tuning the recommender implies training a new classifier.
    • Changing or tuning the feature extraction implies making a new recommender and a new classifier.

Data generation and preparation

In this section we generate random mandala graphics, transform them into images and corresponding vectors. Those image-vectors can be used to apply dimension reduction algorithms. (Other feature extraction algorithms can be applied over the images.)

Generated data

Generate large number of mandalas:

k = 20000;
knownSeedsQ = False;
SeedRandom[343];
lsRSeeds = Union@RandomInteger[{1, 10^9}, k];
AbsoluteTiming[
  aMandalas = 
    If[TrueQ@knownSeedsQ, 
     Association@Table[rs -> BlockRandom[ResourceFunction["RandomMandala"]["RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> Automatic, "ConnectingFunction" -> FilledCurve@*BezierCurve], RandomSeeding -> rs], {rs, lsRSeeds}], 
    (*ELSE*) 
     Association@Table[i -> ResourceFunction["RandomMandala"]["RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> Automatic, "ConnectingFunction" -> FilledCurve@*BezierCurve], {i, 1, k}] 
    ]; 
 ]

(*{18.7549, Null}*)

Check the number of mandalas generated:

Length[aMandalas]

(*20000*)

Show a sample of the generated mandalas:

Magnify[Multicolumn[MandalaToWhiterImage /@ RandomSample[Values@aMandalas, 40], 10, Background -> Black], 0.7]
1gpblane63eo9

Data preparation

Convert the mandala graphics into images using appropriately large (or appropriately small) image sizes:

AbsoluteTiming[
  aMImages = ParallelMap[ImageResize[#, {120, 120}] &, aMandalas]; 
 ]

(*{248.202, Null}*)

Flatten each of the images into vectors:

AbsoluteTiming[
  aMImageVecs = ParallelMap[Flatten[ImageData[Binarize@ColorNegate@ColorConvert[#, "Grayscale"]]] &, aMImages]; 
 ]

(*{16.0125, Null}*)

Remark: Below those vectors are called image-vectors.

Feature extraction

In this section we use the software monad LSAMon, [AA1, AAp1], to do dimension reduction over a subset of random mandala images.

Remark: Other feature extraction methods can be used through the built-in functions FeatureExtraction and FeatureExtract.

Dimension reduction

Create an LSAMon object and extract image topics using Singular Value Decomposition (SVD) or Independent Component Analysis (ICA), [AAr2]:

SeedRandom[893];
AbsoluteTiming[
  lsaObj = 
    LSAMonUnit[]⟹
     LSAMonSetDocumentTermMatrix[SparseArray[Values@RandomSample[aMImageVecs, UpTo[2000]]]]⟹
     LSAMonApplyTermWeightFunctions["None", "None", "Cosine"]⟹
     LSAMonExtractTopics["NumberOfTopics" -> 40, Method -> "ICA", "MaxSteps" -> 240, "MinNumberOfDocumentsPerTerm" -> 0]⟹
     LSAMonNormalizeMatrixProduct[Normalized -> Left]; 
 ]

(*{16.1871, Null}*)

Show the importance coefficients of the topics (if SVD was used the plot would show the singular values):

ListPlot[Norm /@ SparseArray[lsaObj⟹LSAMonTakeH], Filling -> Axis, PlotRange -> All, PlotTheme -> "Scientific"]
1sy1zsgpxysof

Show the interpretation of the extracted image topics:

lsaObj⟹
   LSAMonNormalizeMatrixProduct[Normalized -> Right]⟹
   LSAMonEchoFunctionContext[ImageAdjust[Image[Partition[#, ImageDimensions[aMImages[[1]]][[1]]]]] & /@ SparseArray[#H] &];
16h8a7jwknnkt

Approximation

Pick a test image that is a mandala image or a target image and pre-process it:

If[True, 
   ind = RandomChoice[Range[Length[Values[aMImages]]]]; 
   imgTest = MandalaToWhiterImage@aMandalas[[ind]]; 
   matImageTest = ToSSparseMatrix[SparseArray@List@ImageToVector[imgTest, ImageDimensions[aMImages[[1]]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic], 
  (*ELSE*) 
   imgTest = Binarize[imgStar2, 0.5]; 
   matImageTest = ToSSparseMatrix[SparseArray@List@ImageToVector[imgTest, ImageDimensions[aMImages[[1]]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic] 
  ];
imgTest
0vlq50ryrw0hl

Find the representation of the test image with the chosen feature extractor (LSAMon object here):

matReprsentation = lsaObj⟹LSAMonRepresentByTopics[matImageTest]⟹LSAMonTakeValue;
lsCoeff = Normal@SparseArray[matReprsentation[[1, All]]];
ListPlot[lsCoeff, Filling -> Axis, PlotRange -> All]
1u57b208thtfz

Show the interpretation of the found representation:

H = SparseArray[lsaObj⟹LSAMonNormalizeMatrixProduct[Normalized -> Right]⟹LSAMonTakeH];
vecReprsentation = lsCoeff . H;
ImageAdjust@Image[Rescale[Partition[vecReprsentation, ImageDimensions[aMImages[[1]]][[1]]]]]
1m7r3b5bx32ow

Recommendations

In this section we utilize the software monad SMRMon, [AAp3], to create a recommender for the random mandala images.

Remark: Instead of the Sparse Matrix Recommender (SMR) object the built-in function Nearest can be used.

Create SSparseMatrix object for all image-vectors:

matImages = ToSSparseMatrix[SparseArray[Values@aMImageVecs], "RowNames" -> Automatic, "ColumnNames" -> Automatic]
029x975bs3q7w

Normalize the rows of the image-vectors matrix:

AbsoluteTiming[
  matPixel = WeightTermsOfSSparseMatrix[matImages, "None", "None", "Cosine"] 
 ]
1k9xucwektmhh

Get the LSA topics matrix:

matH = (lsaObj⟹LSAMonNormalizeMatrixProduct[Normalized -> Right]⟹LSAMonTakeH)
05zsn0o1jyqj6

Find the image topics representation for each image-vector (assuming matH was computed with SVD or ICA):

AbsoluteTiming[
  matTopic = matPixel . Transpose[matH] 
 ]
028u1jz1hgzx9

Here we create a recommender based on the images data (pixels) and extracted image topics (or other image features):

smrObj = 
   SMRMonUnit[]⟹
    SMRMonCreate[<|"Pixel" -> matPixel, "Topic" -> matTopic|>]⟹
    SMRMonApplyNormalizationFunction["Cosine"]⟹
    SMRMonSetTagTypeWeights[<|"Pixel" -> 0.2, "Topic" -> 1|>];

Remark: Note the weights assigned to the pixels and the topics in the recommender object above. Those weights were derived by examining the recommendations results shown below.

Here is the image we want to find most similar mandala images to – the target image:

imgTarget = Binarize[imgStar2, 0.5]
1qdmarfxa5i78

Here is the profile of the target image:

aProf = MakeSMRProfile[lsaObj, imgTarget, ImageDimensions[aMImages[[1]]]];
TakeLargest[aProf, 6]

(*<|"10032-10009-4392" -> 0.298371, "3906-10506-10495" -> 0.240086, "10027-10014-4387" -> 0.156797, "8342-8339-6062" -> 0.133822, "3182-3179-11222" -> 0.131565, "8470-8451-5829" -> 0.128844|>*)

Using the target image profile here we compute the recommendation scores for all mandala images of the recommender:

aRecs = 
   smrObj⟹
    SMRMonRecommendByProfile[aProf, All]⟹
    SMRMonTakeValue;

Here is a plot of the similarity scores:

Row[{ResourceFunction["RecordsSummary"][Values[aRecs]], ListPlot[Values[aRecs], ImageSize -> Medium, PlotRange -> All, PlotTheme -> "Detailed", PlotLabel -> "Similarity scores"]}]
1kdiisj4jg4ut

Here are the closest (nearest neighbor) mandala images:

Multicolumn[Values[ImageAdjust@*ColorNegate /@ aMImages[[ToExpression /@ Take[Keys[aRecs], 48]]]], 12, Background -> Black]
096uazw8izidy

Here are the most distant mandala images:

Multicolumn[Values[ImageAdjust@*ColorNegate /@ aMImages[[ToExpression /@ Take[Keys[aRecs], -48]]]], 12, Background -> Black]
0zb7hf24twij4

Classifier creation and utilization

In this section we:

  • Prepare classifier data
  • Build and examine a classifier using the software monad ClCon, [AA2, AAp2], using appropriate training, testing, and validation data ratios
  • Build a classifier utilizing all training data
  • Generate Bethlehem Star mandalas by filtering mandala candidates with the classifier

As it was mentioned above we prepare the data to build classifiers with by:

  • Selecting top, highest scores recommendations and labeling them with True
  • Selecting bad, low score recommendations and labeling them with False
AbsoluteTiming[
  Block[{
    lsBest = Values@aMandalas[[ToExpression /@ Take[Keys[aRecs], 120]]], 
    lsWorse = Values@aMandalas[[ToExpression /@ Join[Take[Keys[aRecs], -200], RandomSample[Take[Keys[aRecs], {3000, -200}], 200]]]]}, 
   lsTrainingData = 
     Join[
      Map[MandalaToWhiterImage[#, ImageDimensions@aMImages[[1]]] -> True &, lsBest], 
      Map[MandalaToWhiterImage[#, ImageDimensions@aMImages[[1]]] -> False &, lsWorse] 
     ]; 
  ] 
 ]

(*{27.9127, Null}*)

Using ClCon train a classifier and show its performance measures:

clObj = 
   ClConUnit[lsTrainingData]⟹
    ClConSplitData[0.75, 0.2]⟹
    ClConMakeClassifier["NearestNeighbors"]⟹
    ClConClassifierMeasurements⟹
    ClConEchoValue⟹
    ClConClassifierMeasurements["ConfusionMatrixPlot"]⟹
    ClConEchoValue;
0jkfza6x72kb5
03uf3deiz0hsd

Remark: We can re-run the ClCon workflow above several times until we obtain a classifier we want to use.

Train a classifier with all prepared data:

clObj2 = 
   ClConUnit[lsTrainingData]⟹
    ClConSplitData[1, 0.2]⟹
    ClConMakeClassifier["NearestNeighbors"];

Get the classifier function from ClCon object:

cfBStar = clObj2⟹ClConTakeClassifier
0awjjib00ihgg

Here we generate Bethlehem Star mandalas using the classifier trained above:

SeedRandom[2020];
Multicolumn[MandalaToWhiterImage /@ BethlehemMandala[12, cfBStar, 0.87], 6, Background -> Black]
0r37g633mpq0y

Generate Bethlehem Star mandala images utilizing the classifier (with a specified classifier probabilities threshold):

SeedRandom[32];
KeyMap[MandalaToWhiterImage, BethlehemMandala[12, cfBStar, 0.87, "Probabilities" -> True]]
0osesxm4gdvvf

Show unfiltered Bethlehem Star mandala candidates:

SeedRandom[32];
KeyMap[MandalaToWhiterImage, BethlehemMandala[12, cfBStar, 0, "Probabilities" -> True]]
0rr12n6savl9z

Remark: Examine the probabilities in the image-probability associations above – they show that the classifier is “working.“

Here is another set generated Bethlehem Star mandalas using rotational symmetry order 4:

SeedRandom[777];
KeyMap[MandalaToWhiterImage, BethlehemMandala[12, cfBStar, 0.8, "RotationalSymmetryOrder" -> 4, "Probabilities" -> True]]
0rgzjquk4amz4

Remark: Note that although a higher rotational symmetry order is used the highly scored results still seem relevant – they have the features of the target Bethlehem Star images.

References

[AA1] Anton Antonov, “A monad for Latent Semantic Analysis workflows”, (2019), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “A monad for classification workflows”, (2018)), MathematicaForPrediction at WordPress.

[MSE1] “Plotting the Star of Bethlehem”, (2020),Mathematica Stack Exchange, question 236499,

[Wk1] Wikipedia entry, Star of Bethlehem.

Packages

[AAr1] Anton Antonov, RandomMandala, (2019), Wolfram Function Repository.

[AAr2] Anton Antonov, IdependentComponentAnalysis, (2019), Wolfram Function Repository.

[AAr3] Anton Antonov, “Simplified Machine Learning Workflows” book, (2019), GitHub/antononcube.

[AAp1] Anton Antonov, Monadic Latent Semantic Analysis Mathematica package, (2017), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, Monadic contextual classification Mathematica package, (2017), MathematicaForPrediction at GitHub/antononcube.

[AAp3] Anton Antonov, Monadic Sparse Matrix Recommender Mathematica package, (2018), MathematicaForPrediction at GitHub/antononcube.

Code definitions

urlPart = "https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/";
Get[urlPart <> "MonadicLatentSemanticAnalysis.m"];
Get[urlPart <> "MonadicSparseMatrixRecommender.m"];
Get[urlPart <> "/MonadicContextualClassification.m"];
Clear[MandalaToImage, MandalaToWhiterImage];
MandalaToImage[gr_Graphics, imgSize_ : {120, 120}] := ColorNegate@ImageResize[gr, imgSize];
MandalaToWhiterImage[gr_Graphics, imgSize_ : {120, 120}] := ColorNegate@ImageResize[gr /. GrayLevel[0.25`] -> Black, imgSize];
Clear[ImageToVector];
ImageToVector[img_Image] := Flatten[ImageData[ColorConvert[img, "Grayscale"]]];
ImageToVector[img_Image, imgSize_] := Flatten[ImageData[ColorConvert[ImageResize[img, imgSize], "Grayscale"]]];
ImageToVector[___] := $Failed;
Clear[MakeSMRProfile];
MakeSMRProfile[lsaObj_LSAMon, gr_Graphics, imgSize_] := MakeSMRProfile[lsaObj, {gr}, imgSize];
MakeSMRProfile[lsaObj_LSAMon, lsGrs : {_Graphics}, imgSize_] := MakeSMRProfile[lsaObj, MandalaToWhiterImage[#, imgSize] & /@ lsGrs, imgSize]
MakeSMRProfile[lsaObj_LSAMon, img_Image, imgSize_] := MakeSMRProfile[lsaObj, {img}, imgSize];
MakeSMRProfile[lsaObj_LSAMon, lsImgs : {_Image ..}, imgSize_] := 
   Block[{lsImgVecs, matTest, aProfPixel, aProfTopic}, 
    lsImgVecs = ImageToVector[#, imgSize] & /@ lsImgs; 
    matTest = ToSSparseMatrix[SparseArray[lsImgVecs], "RowNames" -> Automatic, "ColumnNames" -> Automatic]; 
    aProfPixel = ColumnSumsAssociation[lsaObj⟹LSAMonRepresentByTerms[matTest]⟹LSAMonTakeValue]; 
    aProfTopic = ColumnSumsAssociation[lsaObj⟹LSAMonRepresentByTopics[matTest]⟹LSAMonTakeValue]; 
    aProfPixel = Select[aProfPixel, # > 0 &]; 
    aProfTopic = Select[aProfTopic, # > 0 &]; 
    Join[aProfPixel, aProfTopic] 
   ];
MakeSMRProfile[___] := $Failed;
Clear[BethlehemMandalaCandiate];
BethlehemMandalaCandiate[opts : OptionsPattern[]] := ResourceFunction["RandomMandala"][opts, "RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> Automatic, "ConnectingFunction" -> FilledCurve@*BezierCurve];
Clear[BethlehemMandala];
Options[BethlehemMandala] = Join[{ImageSize -> {120, 120}, "Probabilities" -> False}, Options[ResourceFunction["RandomMandala"]]];
BethlehemMandala[n_Integer, cf_ClassifierFunction, opts : OptionsPattern[]] := BethlehemMandala[n, cf, 0.87, opts];
BethlehemMandala[n_Integer, cf_ClassifierFunction, threshold_?NumericQ, opts : OptionsPattern[]] := 
   Block[{imgSize, probsQ, res, resNew, aResScores = <||>, aResScoresNew = <||>}, 
     
     imgSize = OptionValue[BethlehemMandala, ImageSize]; 
     probsQ = TrueQ[OptionValue[BethlehemMandala, "Probabilities"]]; 
     
     res = {}; 
     While[Length[res] < n, 
      resNew = Table[BethlehemMandalaCandiate[FilterRules[{opts}, Options[ResourceFunction["RandomMandala"]]]], 2*(n - Length[res])]; 
      aResScoresNew = Association[# -> cf[MandalaToImage[#, imgSize], "Probabilities"][True] & /@ resNew]; 
      aResScoresNew = Select[aResScoresNew, # >= threshold &]; 
      aResScores = Join[aResScores, aResScoresNew]; 
      res = Keys[aResScores] 
     ]; 
     
     aResScores = TakeLargest[ReverseSort[aResScores], UpTo[n]]; 
     If[probsQ, aResScores, Keys[aResScores]] 
    ] /; n > 0;
BethlehemMandala[___] := $Failed