Doomsday clock parsing and plotting

Posted on December 31, 2024 by Anton Antonov Antonov

Introduction

The Doomsday Clock is a symbolic timepiece maintained by the Bulletin of the Atomic Scientists (BAS) since 1947. It represents how close humanity is perceived to be to global catastrophe, primarily nuclear war but also including climate change and biological threats. The clock’s hands are set annually to reflect the current state of global security; midnight signifies theoretical doomsday.

In this notebook we consider two tasks:

Parsing of Doomsday Clock reading statements
- Using both Functional Parsers (FP) (aka “parser combinators”), [AAp1], and Large Language Models (LLMs).
  - We take text data from the past announcements, and extract the Doomsday Clock reading statements.
Evolution of Doomsday Clock times
- We extract relevant Doomsday Clock timeline data from the corresponding Wikipedia page.
  - (Instead of using a page from BAS.)
- We show how timeline data from that Wikipedia page can be processed with “standard” Wolfram Language (WL) functions and with LLMs.
- The result plot shows the evolution of the minutes to midnight.
  - The plot could show trends, highlighting significant global events that influenced the clock setting.
  - Hence, we put in informative callouts and tooltips.

The data extraction and visualization in the notebook serve educational purposes or provide insights into historical trends of global threats as perceived by experts. We try to make the ingestion and processing code universal and robust, suitable for multiple evaluations now or in the (near) future.

Remark: Keep in mind that the Doomsday Clock is a metaphor and its settings are not just data points but reflections of complex global dynamics (by certain experts and a board of sponsors.)

Remark: Currently (2024-12-30) Doomsday Clock is set at 90 seconds before midnight.

Data ingestion

Here we ingest the Doomsday Clock timeline page and show corresponding statistics:

url = "https://thebulletin.org/doomsday-clock/timeline/";
txtEN = Import[url, "Plaintext"];
TextStats[txtEN]

(*<|"Characters" -> 77662, "Words" -> 11731, "Lines" -> 1119|>*)

By observing the (plain) text of that page we see the Doomsday Clock time setting can be extracted from the sentence(s) that begin with the following phrase:

startPhrase = "Bulletin of the Atomic Scientists";
sentence = Select[Map[StringTrim, StringSplit[txtEN, "\n"]], StringStartsQ[#, startPhrase] &] // First

(*"Bulletin of the Atomic Scientists, with a clock reading 90 seconds to midnight"*)

Grammar and parsers

Here is a grammar in Extended Backus-Naur Form (EBNF) for parsing Doomsday Clock statements:

ebnf = "
<TOP> = <clock-reading>  ;
<clock-reading> = <opening> , ( <minutes> | [ <minutes> , [ 'and' | ',' ] ] , <seconds> ) , 'to' , 'midnight' ;
<opening> = [ { <any> } ] , 'clock' , [ 'is' ] , 'reading' ; 
<any> = '_String' ;
<minutes> = <integer> <& ( 'minute' | 'minutes' )  <@ \"Minutes\"->#&;
<seconds> = <integer> <& ( 'second' | 'seconds' ) <@ \"Seconds\"->#&;
<integer> = '_?IntegerQ' ;";

Remark: The EBNF grammar above can be obtained with LLMs using a suitable prompt with example sentences. (We do not discuss that approach further in this notebook.)

Here the parsing functions are generated from the EBNF string above:

ClearAll["p*"]
res = GenerateParsersFromEBNF[ParseToEBNFTokens[ebnf]];
res // LeafCount

(*375*)

We must redefine the parser pANY (corresponding to the EBNF rule “”) in order to prevent pANY of gobbling the word “clock” and in that way making the parser pOPENING fail.

pANY = ParsePredicate[StringQ[#] && # != "clock" &];

Here are random sentences generated with the grammar:

SeedRandom[32];
GrammarRandomSentences[GrammarNormalize[ebnf], 6] // Sort // ColumnForm

54jfnd 9y2f clock is reading 46 second to midnight
clock is reading 900 minutes to midnight
clock is reading 955 second to midnight
clock reading 224 minute to midnight
clock reading 410 minute to midnight
jdsf5at clock reading 488 seconds to midnight

Verifications of the (sub-)parsers:

pSECONDS[{"90", "seconds"}]

(*{{{}, "Seconds" -> 90}}*)

pOPENING[ToTokens@"That doomsday clock is reading"]

(*{{{}, {{"That", "doomsday"}, {"clock", {"is", "reading"}}}}}*)

Here the “top” parser is applied:

str = "the doomsday clock is reading 90 seconds to midnight";
pTOP[ToTokens@str]

(*{{{}, {{{"the", "doomsday"}, {"clock", {"is", "reading"}}}, {{{}, "Seconds" -> 90}, {"to", "midnight"}}}}}*)

Here the sentence extracted above is parsed and interpreted into an association with keys “Minutes” and “Seconds”:

aDoomReading = Association@Cases[Flatten[pTOP[ToTokens@sentence]], _Rule]

(*<|"Seconds" -> 90|>*)

Plotting the clock

Using the interpretation derived above here we make a list suitable for ClockGauge:

clockShow = DatePlus[{0, 0, 0, 12, 0, 0}, {-(Lookup[aDoomReading, "Minutes", 0]*60 + aDoomReading["Seconds"]), "Seconds"}]

(*{-2, 11, 30, 11, 58, 30}*)

In that list, plotting of a Doomsday Clock image (or gauge) is trivial.

ClockGauge[clockShow, GaugeLabels -> Automatic]

Let us define a function that makes the clock-gauge plot for a given association.

Clear[DoomsdayClockGauge];
Options[DoomsdayClockGauge] = Options[ClockGauge];
DoomsdayClockGauge[m_Integer, s_Integer, opts : OptionsPattern[]] := DoomsdayClockGauge[<|"Minutes" -> m, "Seconds" -> s|>, opts];
DoomsdayClockGauge[a_Association, opts : OptionsPattern[]] :=
  Block[{clockShow},
   clockShow = DatePlus[{0, 0, 0, 12, 0, 0}, {-(Lookup[a, "Minutes", 0]*60 + Lookup[a, "Seconds", 0]), "Seconds"}];
   ClockGauge[clockShow, opts, GaugeLabels -> Placed[Style["Doomsday\nclock", RGBColor[0.7529411764705882, 0.7529411764705882, 0.7529411764705882], FontFamily -> "Krungthep"], Bottom]]
   ];

Here are examples:

Row[{
   DoomsdayClockGauge[17, 0], 
   DoomsdayClockGauge[1, 40, GaugeLabels -> Automatic, PlotTheme -> "Scientific"], 
   DoomsdayClockGauge[aDoomReading, PlotTheme -> "Marketing"] 
  }]

More robust parsing

More robust parsing of Doomsday Clock statements can be obtained in these three ways:

“Fuzzy” match of words
- For misspellings like “doomsdat” instead of “doomsday.”
Parsing of numeric word forms.
- For statements, like, “two minutes and twenty five seconds.”
Delegating the parsing to LLMs when grammar parsing fails.

Fuzzy matching

The parser ParseFuzzySymbol can be used to handle misspellings (via EditDistance):

pDD = ParseFuzzySymbol["doomsday", 2];
lsPhrases = {"doomsdat", "doomsday", "dumzday"};
ParsingTestTable[pDD, lsPhrases]

In order to include the misspelling handling into the grammar we manually rewrite the grammar. (The grammar is small, so, it is not that hard to do.)

pANY = ParsePredicate[StringQ[#] && EditDistance[#, "clock"] > 1 &];
pOPENING = ParseOption[ParseMany[pANY]]⊗ParseFuzzySymbol["clock", 1]⊗ParseOption[ParseSymbol["is"]]⊗ParseFuzzySymbol["reading", 2];
pMINUTES = "Minutes" -> # &⊙(pINTEGER ◁ ParseFuzzySymbol["minutes", 3]);
pSECONDS = "Seconds" -> # &⊙(pINTEGER ◁ ParseFuzzySymbol["seconds", 3]);
pCLOCKREADING = Cases[#, _Rule, Infinity] &⊙(pOPENING⊗(pMINUTES⊕ParseOption[pMINUTES⊗ParseOption[ParseSymbol["and"]⊕ParseSymbol["&"]⊕ParseSymbol[","]]]⊗pSECONDS)⊗ParseSymbol["to"]⊗ParseFuzzySymbol["midnight", 2]);

Here is a verification table with correct- and incorrect spellings:

lsPhrases = {
    "doomsday clock is reading 2 seconds to midnight", 
    "dooms day cloc is readding 2 minute and 22 sekonds to mildnight"};
ParsingTestTable[pCLOCKREADING, lsPhrases, "Layout" -> "Vertical"]

Parsing of numeric word forms

One way to make the parsing more robust is to implement the ability to parse integer names (or numeric word forms) not just integers.

Remark: For a fuller discussion — and code — of numeric word forms parsing see the tech note “Integer names parsing” of the paclet “FunctionalParsers”, [AAp1].

First, we make an association that connects integer names with corresponding integer values

aWordedValues = Association[IntegerName[#, "Words"] -> # & /@ Range[0, 100]];
aWordedValues = KeyMap[StringRiffle[StringSplit[#, RegularExpression["\\W"]], " "] &, aWordedValues];
Length[aWordedValues]

(*101*)

Here is how the rules look like:

aWordedValues[[1 ;; -1 ;; 20]]

(*<|"zero" -> 0, "twenty" -> 20, "forty" -> 40, "sixty" -> 60, "eighty" -> 80, "one hundred" -> 100|>*)

Here we program the integer names parser:

pUpTo10 = ParseChoice @@ Map[ParseSymbol[IntegerName[#, {"English", "Words"}]] &, Range[0, 9]];
p10s = ParseChoice @@ Map[ParseSymbol[IntegerName[#, {"English", "Words"}]] &, Range[10, 100, 10]];
pWordedInteger = ParseApply[aWordedValues[StringRiffle[Flatten@{#}, " "]] &, p10s\[CircleTimes]pUpTo10\[CirclePlus]p10s\[CirclePlus]pUpTo10];

Here is a verification table of that parser:

lsPhrases = {"three", "fifty seven", "thirti one"};
ParsingTestTable[pWordedInteger, lsPhrases]

There are two parsing results for “fifty seven”, because pWordedInteger is defined with p10s⊗pUpTo10⊕p10s… . This can be remedied by using ParseJust or ParseShortest:

lsPhrases = {"three", "fifty seven", "thirti one"};
ParsingTestTable[ParseJust@pWordedInteger, lsPhrases]

Let us change pINTEGER to parse both integers and integer names:

pINTEGER = (ToExpression\[CircleDot]ParsePredicate[StringMatchQ[#, NumberString] &])\[CirclePlus]pWordedInteger;
lsPhrases = {"12", "3", "three", "forty five"};
ParsingTestTable[pINTEGER, lsPhrases]

Let us try the new parser using integer names for the clock time:

str = "the doomsday clock is reading two minutes and forty five seconds to midnight";
pTOP[ToTokens@str]

(*{{{}, {"Minutes" -> 2, "Seconds" -> 45}}}*)

Enhance with LLM parsing

There are multiple ways to employ LLMs for extracting “clock readings” from arbitrary statements for Doomsday Clock readings, readouts, and measures. Here we use LLM few-shot training:

flop = LLMExampleFunction[{
    "the doomsday clock is reading two minutes and forty five seconds to midnight" -> "{\"Minutes\":2, \"Seconds\": 45}", 
    "the clock of the doomsday gives 92 seconds to midnight" -> "{\"Minutes\":0, \"Seconds\": 92}", 
    "The bulletin atomic scienist maybe is set to a minute an 3 seconds." -> "{\"Minutes\":1, \"Seconds\": 3}" 
   }, "JSON"]

Here is an example invocation:

flop["Maybe the doomsday watch is at 23:58:03"]

(*{"Minutes" -> 1, "Seconds" -> 57}*)

The following function combines the parsing with the grammar and the LLM example function — the latter is used for fallback parsing:

Clear[GetClockReading];
GetClockReading[st_String] := 
   Block[{op}, 
    op = ParseJust[pTOP][ToTokens[st]]; 
    Association@
     If[Length[op] > 0 && op[[1, 1]] === {}, 
      Cases[op, Rule], 
     (*ELSE*) 
      flop[st] 
     ] 
   ];

Robust parser demo

Here is the application of the combine function above over a certain “random” Doomsday Clock statement:

s = "You know, sort of, that dooms-day watch is 1 and half minute be... before the big boom. (Of doom...)";
GetClockReading[s]

(*<|"Minutes" -> 1, "Seconds" -> 30|>*)

Remark: The same type of robust grammar-and-LLM combination is explained in more detail in the video “Robust LLM pipelines (Mathematica, Python, Raku)”, [AAv1]. (See, also, the corresponding notebook [AAn1].)

Timeline

In this section we extract Doomsday Clock timeline data and make a corresponding plot.

Parsing page data

Instead of using the official Doomsday clock timeline page we use Wikipedia:

url = "https://en.wikipedia.org/wiki/Doomsday_Clock";
data = Import[url, "Data"];

Get timeline table:

tbl = Cases[data, {"Timeline of the Doomsday Clock [ 13 ] ", x__} :> x, Infinity] // First;

Show table’s columns:

First[tbl]

(*{"Year", "Minutes to midnight", "Time ( 24-h )", "Change (minutes)", "Reason", "Clock"}*)

Make a dataset:

dsTbl = Dataset[Rest[tbl]][All, AssociationThread[{"Year", "MinutesToMidnight", "Time", "Change", "Reason"}, #] &];
dsTbl = dsTbl[All, Append[#, "Date" -> DateObject[{#Year, 7, 1}]] &];
dsTbl[[1 ;; 4]]

Here is an association used to retrieve the descriptions from the date objects:

aDateToDescr = Normal@dsTbl[Association, #Date -> BreakStringIntoLines[#Reason] &];

Using LLM-extraction instead

Alternatively, we can extract the Doomsday Clock timeline using LLMs. Here we get the plaintext of the Wikipedia page and show statistics:

txtWk = Import[url, "Plaintext"];
TextStats[txtWk]

(*<|"Characters" -> 43623, "Words" -> 6431, "Lines" -> 315|>*)

Here we get the Doomsday Clock timeline table from that page in JSON format using an LLM:

res = 
  LLMSynthesize[{
    "Give the time table of the doomsday clock as a time series that is a JSON array.", 
    "Each element of the array is a dictionary with keys 'Year', 'MinutesToMidnight', 'Time', 'Description'.", 
    txtWk, 
    LLMPrompt["NothingElse"]["JSON"] 
   }, 
   LLMEvaluator -> LLMConfiguration[<|"Provider" -> "OpenAI", "Model" -> "gpt-4o", "Temperature" -> 0.4, "MaxTokens" -> 5096|>] 
  ]

(*"```json[{\"Year\": 1947, \"MinutesToMidnight\": 7, \"Time\": \"23:53\", \"Description\": \"The initial setting of the Doomsday Clock.\"},{\"Year\": 1949, \"MinutesToMidnight\": 3, \"Time\": \"23:57\", \"Description\": \"The Soviet Union tests its first atomic bomb, officially starting the nuclear arms race.\"}, ... *)

Post process the LLM result:

res2 = ToString[res, CharacterEncoding -> "UTF-8"];
res3 = StringReplace[res2, {"```json", "```"} -> ""];
res4 = ImportString[res3, "JSON"];
res4[[1 ;; 3]]

(*{{"Year" -> 1947, "MinutesToMidnight" -> 7, "Time" -> "23:53", "Description" -> "The initial setting of the Doomsday Clock."}, {"Year" -> 1949, "MinutesToMidnight" -> 3, "Time" -> "23:57", "Description" -> "The Soviet Union tests its first atomic bomb, officially starting the nuclear arms race."}, {"Year" -> 1953, "MinutesToMidnight" -> 2, "Time" -> "23:58", "Description" -> "The United States and the Soviet Union test thermonuclear devices, marking the closest approach to midnight until 2020."}}*)

Make a dataset with the additional column “Date” (having date-objects):

dsDoomsdayTimes = Dataset[Association /@ res4];
dsDoomsdayTimes = dsDoomsdayTimes[All, Append[#, "Date" -> DateObject[{#Year, 7, 1}]] &];
dsDoomsdayTimes[[1 ;; 4]]

Here is an association that is used to retrieve the descriptions from the date objects:

aDateToDescr2 = Normal@dsDoomsdayTimes[Association, #Date -> #Description &];

Remark: The LLM derived descriptions above are shorter than the descriptions in the column “Reason” of the dataset obtained parsing the page data. For the plot tooltips below we use the latter.

Timeline plot

In order to have informative Doomsday Clock evolution plot we obtain and partition dataset’s time series into step-function pairs:

ts0 = Normal@dsDoomsdayTimes[All, {#Date, #MinutesToMidnight} &];
ts2 = Append[Flatten[MapThread[Thread[{#1, #2}] &, {Partition[ts0[[All, 1]], 2, 1], Most@ts0[[All, 2]]}], 1], ts0[[-1]]];

Here are corresponding rule wrappers indicating the year and the minutes before midnight:

lbls = Map[Row[{#Year, Spacer[3], "\n", IntegerPart[#MinutesToMidnight], Spacer[2], "m", Spacer[2], Round[FractionalPart[#MinutesToMidnight]*60], Spacer[2], "s"}] &, Normal@dsDoomsdayTimes];
lbls = Map[If[#[[1, -3]] == 0, Row@Take[#[[1]], 6], #] &, lbls];

Here the points “known” by the original time series are given callouts:

aRules = Association@MapThread[#1 -> Callout[Tooltip[#1, aDateToDescr[#1[[1]]]], #2] &, {ts0, lbls}];
ts3 = Lookup[aRules, Key[#], #] & /@ ts2;

Finally, here is the plot:

DateListPlot[ts3, 
  PlotStyle -> Directive[{Thickness[0.007`], Orange}],
  Epilog -> {PointSize[0.01`], Black, Point[ts0]}, 
  PlotLabel -> Row[(Style[#1, FontSize -> 16, FontColor -> Black, FontFamily -> "Verdana"] &) /@ {"Doomsday clock: minutes to midnight,", Spacer[3], StringRiffle[MinMax[Normal[dsDoomsdayTimes[All, "Year"]]], "-"]}], 
  FrameLabel -> {"Year", "Minutes to midnight"}, 
  Background -> GrayLevel[0.94`], Frame -> True, 
  FrameTicks -> {{Automatic, (If[#1 == 0, {0, Style["00:00", Red]}, {#1, Row[{"23:", 60 - #1}]}] &) /@ Range[0, 17]}, {Automatic, Automatic}}, GridLines -> {None, All},
  AspectRatio -> 1/3, ImageSize -> 1200
]

Remark: By hovering with the mouse over the black points the corresponding descriptions can be seen. We considered using clock-gauges as tooltips, but showing clock-settings reasons is more informative.

Remark: The plot was intentionally made to resemble the timeline plot in Doomsday Clock’s Wikipedia page.

Conclusion

As expected, parsing, plotting, or otherwise processing the Doomsday Clock settings and statements are excellent didactic subjects for textual analysis (or parsing) and temporal data visualization. The visualization could serve educational purposes or provide insights into historical trends of global threats as perceived by experts. (Remember, the clock’s settings are not just data points but reflections of complex global dynamics.)

One possible application of the code in this notebook is to make a “web service“ that gives clock images with Doomsday Clock readings. For example, click on this button:

Get Doomsday Clock!

Setup

Needs["AntonAntonov`FunctionalParsers`"]

Clear[TextStats];
TextStats[s_String] := AssociationThread[{"Characters", "Words", "Lines"}, Through[{StringLength, Length@*TextWords, Length@StringSplit[#, "\n"] &}[s]]];

BreakStringIntoLines[str_String, maxLength_Integer : 60] := Module[
    {words, lines, currentLine}, 
    words = StringSplit[StringReplace[str, RegularExpression["\\v+"] -> " "]]; 
    lines = {}; 
    currentLine = ""; 
    Do[
       If[StringLength[currentLine] + StringLength[word] + 1 <= maxLength, 
          currentLine = StringJoin[currentLine, If[currentLine === "", "", " "], word], 
          AppendTo[lines, currentLine]; 
          currentLine = word; 
        ], 
       {word, words} 
     ]; 
    AppendTo[lines, currentLine]; 
    StringJoin[Riffle[lines, "\n"]] 
  ]

References

Age at creation for programming languages stats

Posted on May 22, 2024 by Anton Antonov Antonov

Introduction

In this blog post (notebook) we ingest programming languages creation data from “Programming Language DataBase” and visualize several statistics of it.

We do not examine the data source and we do not want to reason too much about the data using the stats. We started this notebook by just wanting to make the bubble charts (both 2D and 3D.) Nevertheless, we are tempted to say and justify statements like:

Pareto holds, as usual.
Language creators tend to do it more than once.
Beware the Second system effect.

References

Here are reference links with explanations and links to dataset files:

The Ages of Programming Language Creators (pldb.io)
- Short note about data and related statistics; provides a link to a TSV file with the data.
The Ages of Programming Language Creators (datawrapper.dwcdn.net)
- “Just a plot”; provides a link to a CSV file with the data.
The Ages of Programming Language Creators (Reddit)
- Link(s) and discussion.

Data ingestion

Here we get the TSC file with Wolfram Function Repository (WFR) function ImportCSVToDataset:

url = "https://pldb.io/posts/age.tsv";
dsData = ResourceFunction["ImportCSVToDataset"][url, "Dataset", "FieldSeparators" -> "\t"];
dsData[[1 ;; 4]]

Here we summarize the data using the WFR function RecordsSummary:

ResourceFunction["RecordsSummary"][dsData, "MaxTallies" -> 12]

Here is a list of languages we use to “get orientated” in the plots below:

lsFocusLangs = {"C++", "Fortran", "Java", "Mathematica", "Perl 6", "Raku", "SQL", "Wolfram Language"};

Here we find the most important tags (used in the plots below):

lsTopTags = ReverseSortBy[Tally[Normal@dsData[All, "tags"]], Last][[1 ;; 7, 1]]

(*{"pl", "textMarkup", "dataNotation", "grammarLanguage", "queryLanguage", "stylesheetLanguage", "protocol"}*)

Here we add the column “group” based on the focus languages and most important tags:

dsData = dsData[All, Append[#, "group" -> Which[MemberQ[lsFocusLangs, #id], "focus", MemberQ[lsTopTags, #tags], #tags, True, "other"]] &];

Distributions

Here are the distributions of the variables/columns:

age at creation
- i.e. “How old was the creator?”
appeared”
- i.e. “In what year the programming language was proclaimed?”

Association @ Map[# -> Histogram[Normal@dsData[All, #], 20, "Probability", Sequence[ImageSize -> Medium, PlotTheme -> "Detailed"]] &, {"ageAtCreation", "appeared"}]

Here are corresponding Box-Whisker plots together with tables of their statistics:

aBWCs = Association@
   Map[# -> BoxWhiskerChart[Normal@dsData[All, #], "Outliers", Sequence[BarOrigin -> Left, ImageSize -> Medium, AspectRatio -> 1/2, PlotRange -> Full]] &, {"ageAtCreation", "appeared"}];

Pareto principle manifestation

Number of creations

Here is the Pareto principle plot of for the number of created (or renamed) programming languages per creator (using the WFR function ParetoPrinciplePlot):

ResourceFunction["ParetoPrinciplePlot"][Association[Rule @@@ Tally[Normal@dsData[All, "creators"]]], ImageSize -> Large]

We can see that ≈25% of the creators correspond to ≈50% of the languages.

Popularity

Obviously, programmers can and do use more than one programming language. Nevertheless, it is interesting to see the Pareto principle plot for the languages “mind share” based on the number of users estimates.

ResourceFunction["ParetoPrinciplePlot"][Normal@dsData[Association, #id -> #numberOfUsersEstimate &], ImageSize -> Large]

Remark: Again, the plot above is “wrong” — programmers use more than one programming language.

Correlations

In order to see meaningful correlation, pairwise plots we take logarithms of the large value columns:

dsDataVar = dsData[All, {"appeared", "ageAtCreation", "numberOfUsersEstimate", "numberOfJobsEstimate", "rank", "measurements", "pldbScore"}];
dsDataVar = dsDataVar[All, Append[#, <|"numberOfUsersEstimate" -> Log10[#numberOfUsersEstimate + 1], "numberOfJobsEstimate" -> Log10[#numberOfJobsEstimate + 1]|>] &];

Remark: Note that we “cheat” by adding 1 before taking the logarithms.

We obtain the tables of correlations plots using the newly introduced, experimental PairwiseListPlot. If we remove the rows with zeroes some of the correlations become more obvious. Here is the corresponding tab view of the two correlation tables:

TabView[{
   "data" -> PairwiseListPlot[dsDataVar, PlotTheme -> "Business", ImageSize -> 800], 
   "zero-free data" -> PairwiseListPlot[dsDataVar[Select[FreeQ[Values[#], 0] &]], PlotTheme -> "Business", ImageSize -> 800]}]

Remark: Given the names of the data columns and the corresponding obvious interpretations we can say that the stronger correlations make sense.

Bubble chart 2D

In this section we make an informative 2D bubble chart with (tooltips).

First, note that not all triplets of “appeared”,”ageAtCreation”, and “numberOfUsersEstimate” are unique:

ReverseSortBy[Tally[Normal[dsData[All, {"appeared", "ageAtCreation", "numberOfUsersEstimate"}]]], Last][[1 ;; 3]]

(*{{<|"appeared" -> 2017, "ageAtCreation" -> 33, "numberOfUsersEstimate" -> 420|>, 2}, {<|"appeared" -> 2023, "ageAtCreation" -> 39, "numberOfUsersEstimate" -> 11|>, 1}, {<|"appeared" -> 2022, "ageAtCreation" -> 55, "numberOfUsersEstimate" -> 6265|>, 1}}*)

Hence we make two datasets: (1) one for the core bubble chart, (2) the other for the labeling function:

aData = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "numberOfUsersEstimate"}] &];
aData2 = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "numberOfUsersEstimate", "id", "creators"}] &];

Here is the labeling function (see the section “Applications” of the function page of BubbleChart):

Clear[LangLabeler];
LangLabeler[v_, {r_, c_}, ___] := Placed[Grid[{
      {Style[aData2[[r, c]]["id"], Bold, 12], SpanFromLeft}, 
      {"Creator(s):", aData2[[r, c]]["creators"]}, 
      {"Appeared:", aData2[[r, c]]["appeared"]}, 
      {"Age at creation:", aData2[[r, c]]["ageAtCreation"]}, 
      {"Number of users:", aData2[[r, c]]["numberOfUsersEstimate"]} 
     }, Alignment -> Left], Tooltip];

Here is the bubble chart:

BubbleChart[
  aData, 
  FrameLabel -> {"Age at Creation", "Appeared"}, 
  PlotLabel -> "Number of users estimate", 
  BubbleSizes -> {0.05, 0.14}, 
  LabelingFunction -> LangLabeler, 
  AspectRatio -> 1/2.5, 
  ChartStyle -> 7, 
  PlotTheme -> "Detailed", 
  ChartLegends -> {Keys[aData], None}, 
  ImageSize -> 1000 
 ]

Remark: The programming language J is a clear outlier because of creators’ ages.

Bubble chart 3D

In this section we a 3D bubble chart.

As in the previous section we define two datasets: for the core plot and for the tooltips:

aData3D = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "measurements", "numberOfUsersEstimate"}] &];
aData3D2 = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "measurements", "numberOfUsersEstimate", "id", "creators"}] &];

Here is the corresponding labeling function:

Clear[LangLabeler3D];
LangLabeler3D[v_, {r_, c_}, ___] := Placed[Grid[{
      {Style[aData3D2[[r, c]]["id"], Bold, 12], SpanFromLeft}, 
      {"Creator(s):", aData3D2[[r, c]]["creators"]}, 
      {"Appeared:", aData3D2[[r, c]]["appeared"]}, 
      {"Age at creation:", aData3D2[[r, c]]["ageAtCreation"]}, 
      {"Number of users:", aData3D2[[r, c]]["numberOfUsersEstimate"]} 
     }, Alignment -> Left], Tooltip];

Here is the 3D chart:

BubbleChart3D[
  aData3D, 
  AxesLabel -> {"appeared", "ageAtCreation", "measuremnts"}, 
  PlotLabel -> "Number of users estimate", 
  BubbleSizes -> {0.02, 0.07}, 
  LabelingFunction -> LangLabeler3D, 
  BoxRatios -> {1, 1, 1}, 
  ChartStyle -> 7, 
  PlotTheme -> "Detailed", 
  ChartLegends -> {Keys[aData], None}, 
  ImageSize -> 1000 
 ]

Remark: In the 3D bubble chart plot “Mathematica” and “Wolfram Language” are easier to discern.

Second system effect traces

In this section we try — and fail — to demonstrate that the more programming languages a team of creators makes the less successful those languages are. (Maybe, because they are more cumbersome and suffer the Second system effect?)

Remark: This section is mostly made “for fun.” It is not true that each sets of languages per creators team is made of comparable languages. For example, complementary languages can be in the same set. (See, HTTP, HTML, URL.) Some sets are just made of the same language but with different names. (See, Perl 6 and Raku, and Mathematica and Wolfram Language.) Also, older languages would have the First mover advantage.

Make creators to index association:

aCreators = KeySort@Association[Rule @@@ Select[Tally[Normal@dsData[All, "creators"]], #[[2]] > 1 &]];
aNameToIndex = AssociationThread[Keys[aCreators], Range[Length[aCreators]]];

Make a bubble chart with relative popularity per creators team:

aNUsers = Normal@GroupBy[dsData, #creators &, (m = Max[1, Max[Sqrt@KeyTake[#, "numberOfUsersEstimate"]]]; Map[Tooltip[{#appeared, #creators /. aNameToIndex, Sqrt[#numberOfUsersEstimate]/m}, Grid[{{Style[#id, Black, Bold], SpanFromLeft}, {"Creator(s):", #creators}, {"Users:", #numberOfUsersEstimate}}, Alignment -> Left]] &, #]) &];
aNUsers = KeySort@Select[aNUsers, Length[#] > 1 &];
BubbleChart[aNUsers, AspectRatio -> 2, BubbleSizes -> {0.02, 0.05}, ChartLegends -> Keys[aNUsers], ImageSize -> Large, GridLines -> {None, Values[aNameToIndex]}, FrameTicks -> {{Reverse /@ (List @@@ Normal[aNameToIndex]), None}, {Automatic, Automatic}}]

From the plot above we cannot decisively say that:

The most recent creation of a team of programming language creators is not team's most popular creation.

That statement, though, does hold for a fair amount of cases.

Instead of conclusions

Consider:

Making an interactive interface for the variables, types of plots, etc.
Placing callouts for the focus languages in bubble charts.

Cryptocurrencies data explorations

Posted on June 22, 2021 by Anton Antonov Antonov

Introduction

The main goal of this notebook is to provide some basic views and insights into the landscape of cryptocurrencies. The “landscape” we consider consists of price action and trading volume time series for cryptocurrencies found in Yahoo Finance.

Here is the work plan followed in this notebook:

Get cryptocurrency data
Do basic data analysis over suitable date ranges
Gather important cryptocurrency events
Plot together cryptocurrency prices and trading volume time series together with the events
Make observations and conjectures over the plots
Find “global” correlations between the different cryptocurrencies
Find clusters of cryptocurrencies based on time series correlations

Here are some details for the steps above:

The procedure of obtaining the cryptocurrencies data, point 1, is explained in detail in [AA1].
- There is a dedicated resource object CrypocurrencyData that provides cryptocurrency data and related documentation.
The cryptocurrency events data, point 3, is taken from different news sites.
- Links are provided in the corresponding dataset.
The points 6 and 7 follow similar explorations (and code) described in [AA2, AA3].
- Those two articles deal with COVID-19 time series.

Remark: Note that in this notebook we do not discuss philosophical, macro-economic, and environmental issues with cryptocurrencies. We only discuss financial time series data.

Cryptocurrencies data

The cryptocurrencies data used in this notebook is obtained from found in Yahoo Finance . The procedure of obtaining the cryptocurrencies data is explained in detail in [AA1]. There is a dedicated resource object CrypocurrencyData that provides the cryptocurrency data and related documentation.

Here are all cryptocurrencies we have data for:

ResourceFunction["CryptocurrencyData"]["CryptocurrencyNames"]

(*<|"BTC" -> "Bitcoin", "ETH" -> "Ethereum", "USDT" -> "Tether", "BNB" -> "BinanceCoin", "ADA" -> "Cardano", "XRP" -> "XRP", "USDC" -> "Coin", "DOGE" -> "Dogecoin", "DOT1" -> "Polkadot", "HEX" -> "HEX", "UNI3" -> "Uniswap", "BCH" -> "BitcoinCash", "LTC" -> "Litecoin", "LINK" -> "Chainlink", "SOL1" -> "Solana", "MATIC" -> "MaticNetwork", "THETA" -> "THETA", "XLM" -> "Stellar", "VET" -> "VeChain", "ICP1" -> "InternetComputer", "ETC" -> "EthereumClassic", "TRX" -> "TRON", "FIL" -> "FilecoinFutures", "XMR" -> "Monero", "EOS" -> "EOS"|>*)

Remark: FinancialData is “aware” of 10 cryptocurrencies, but that is not documented (as far as I can tell) and only prices are provided. (For more details see the discussion in CrypocurrencyData.) Here are examples:

Row[DateListPlot[FinancialData[#, "Jan 1 2021"], ImageSize -> Medium, AspectRatio -> 1/4, PlotLabel -> #] & /@ {"BTC", "ETH"}]

Significant cryptocurrencies

In this section we analyze the summaries of cryptocurrencies data in order to derive a list of the most significant ones.

We choose the phrase “significant cryptocurrency” to mean “a cryptocurrency with high market capitalization, price, or trading volume.”

Together with the summaries we look into the Pareto principle adherence of the corresponding values.

Remark: The Pareto principle adherence should be interpreted carefully here – the cryptocurrencies are not mutually exclusive when in comes to money invested and trading volumes. Nevertheless, we can interpret the corresponding value ratios as indicators of “mind share” or “significance.”

By summaries

Here is a summary of the cryptocurrencies we consider (from Yahoo Finance) ordered by “Market Cap” (largest first):

dsCCSummary = ResourceFunction["CryptocurrencyData"][All, "Summary"]

Here is the summary of summary dataset above:

ResourceFunction["RecordsSummary"][dsCCSummary]

Here is a Pareto principle adherence plot for the cryptocurrency market caps:

aMCaps = Normal[dsCCSummary[Association, StringSplit[#Symbol, "-"][[1]] -> #["Market Cap"] &]]; ResourceFunction["ParetoPrinciplePlot"][aMCaps, PlotRange -> All, PlotLabel -> "Pareto principle for cryptocurrency market caps"]

Here is the Pareto statistic for the top 12 cryptocurrencies:

Take[AssociationThread[Keys@aMCaps, Accumulate[Values@aMCaps]]/Total[aMCaps], 12]

(*<|"BTC" -> 0.521221, "ETH" -> 0.71188, "USDT" -> 0.765931, "BNB" -> 0.800902, "ADA" -> 0.833777, "XRP" -> 0.856467, "USDC" -> 0.878274, "DOGE" -> 0.899587, "DOT1" -> 0.9121, "HEX" -> 0.924055, "UNI3" -> 0.932218, "BCH" -> 0.939346|>*)

By price

Get the mean daily closing prices data for the last two weeks and show the corresponding data summary:

startDate = DatePlus[Now, -Quantity[2, "Weeks"]]; aMeans = ReverseSort[Association[# -> Mean[ResourceFunction["CryptocurrencyData"][#, "Close", startDate]["Values"]] & /@ ResourceFunction["CryptocurrencyData"]["Cryptocurrencies"]]];
ResourceFunction["RecordsSummary"][aMeans, Thread -> True]

Pareto principle adherence plot:

ResourceFunction["ParetoPrinciplePlot"][aMeans, PlotRange -> All, PlotLabel -> "Pareto principle for cryptocurrency closing prices"]

Here are the Pareto statistic values for the top 12 cryptocurrencies:

aCCTop = Take[AssociationThread[Keys@aMeans, Accumulate[Values@aMeans]]/Total[aMeans], 12]

(*<|"BTC" -> 0.902595, "ETH" -> 0.959915, "BCH" -> 0.974031, "BNB" -> 0.982414, "XMR" -> 0.988689, "LTC" -> 0.992604, "FIL" -> 0.99426, "ICP1" -> 0.995683, "ETC" -> 0.997004, "SOL1" -> 0.997906, "LINK" -> 0.998449, "UNI3" -> 0.998987|>*)

Plot the daily closing prices of top cryptocurrencies since January 2018:

DateListPlot[Log10 /@ Association[# -> ResourceFunction["CryptocurrencyData"][#, "Close", "Jan 1, 2018"] & /@ Keys[aCCTop]], PlotLabel -> "lg of crytocurrencies daily closing prices, USD", PlotTheme -> "Detailed", PlotRange -> All]

By trading volume

Get the mean daily trading volumes data for the last two weeks and show the corresponding data summary:

startDate = DatePlus[Now, -Quantity[2, "Weeks"]]; aMeans = ReverseSort[Association[# -> Mean[ResourceFunction["CryptocurrencyData"][#, "Volume", startDate]["Values"]] & /@ ResourceFunction["CryptocurrencyData"]["Cryptocurrencies"]]];
ResourceFunction["RecordsSummary"][aMeans, Thread -> True]

Pareto principle adherence plot:

ResourceFunction["ParetoPrinciplePlot"][aMeans, PlotRange -> {0, 1.1},PlotRange -> All, PlotLabel -> "Pareto principle for cryptocurrency trading volumes"]

Here are the Pareto statistic values for the top 12 cryptocurrencies:

aCCTop = N@Take[AssociationThread[Keys@aMeans, Accumulate[Values@aMeans]]/Total[aMeans], 12]

(*<|"USDT" -> 0.405697, "BTC" -> 0.657918, "ETH" -> 0.817959, "XRP" -> 0.836729, "ADA" -> 0.853317, "ETC" -> 0.868084, "LTC" -> 0.882358, "DOGE" -> 0.896621, "BNB" -> 0.910013, "USDC" -> 0.923379, "BCH" -> 0.933938, "DOT1" -> 0.944249|>*)

Plot the daily closing prices of top cryptocurrencies since January 2018:

DateListPlot[Log10 /@ Association[# -> ResourceFunction["CryptocurrencyData"][#, "Volume", "Jan 1, 2018"] & /@ Keys[aCCTop]], PlotLabel -> "lg of cryptocurrencies trading volumes", PlotTheme -> "Detailed", PlotRange -> {5, Automatic}]

In this section we make a dataset that has the dates of certain cryptocurrency related events and links to their news announcements.

The events were taken by observing cryptocurrency board stories in the news aggregation site slashdot.org.

lsEventData = {
    {"Jun 18, 2021", "China to shut down over 90% of its Bitcoin mining capacity after local bans", "https://www.globaltimes.cn/page/202106/1226598.shtml"}, 
    {"Jun 10, 2021", "Global banking regulators call for toughest rules for cryptocurrencies", "https://www.theguardian.com/technology/2021/jun/10/global-banking-regulators-cryptocurrencies-bitcoin"}, 
    {"June 10, 2021", "IMF sees legal, economic issues with El Salvador's bitcoin move","https://www.reuters.com/business/finance/imf-sees-legal-economic-issues-with-el-salvador-bitcoin-move-2021-06-10/"}, 
    {"June 8, 2021", "El Salvador Becomes First Country To Adopt Bitcoin as Legal Tender After Passing Law", "https://www.cnbc.com/2021/06/09/el-salvador-proposes-law-to-make-bitcoin-legal-tender.html"}, 
    {"June 8, 2021", "US recovers millions in cryptocurrency paid to Colonial Pipeline ransomware hackers", "https://edition.cnn.com/2021/06/07/politics/colonial-pipeline-ransomware-recovered/"}, 
    {"June 4, 2021", "Start of Bitcoin 2021: World\[CloseCurlyQuote]s Largest Cryptocurrency Conference Coming To Wynwood", "https://miami.cbslocal.com/2021/06/04/bitcoin-2021-worlds-largest-cryptocurrency-conference-coming-to-wynwood/"}, 
    {"June 6, 2021", "End of Bitcoin 2021: World\[CloseCurlyQuote]s Largest Cryptocurrency Conference Coming To Wynwood", "https://miami.cbslocal.com/2021/06/04/bitcoin-2021-worlds-largest-cryptocurrency-conference-coming-to-wynwood/"}, 
    {"May 28, 2021", "Iran Bans Crypto Mining After Months of Blackouts", "https://gizmodo.com/iran-bans-crypto-mining-after-months-of-blackouts-1846991039"}, 
    {"May 19, 2021", "Bitcoin, Ethereum prices in free fall as China plans crackdown on mining and trading", "https://www.cnet.com/news/bitcoin-ethereum-prices-in-freefall-as-china-plans-crackdown-on-mining-and-trading/#ftag=CAD590a51e"} 
   };
dsEventData = Dataset[lsEventData][All, AssociationThread[{"Date", "Event", "URL"}, #] &];
dsEventData = dsEventData[All, Join[Prepend[#, "DateObject" -> DateObject[#Date]], <|"URL" -> URL[#URL]|>] &];
dsEventData = dsEventData[SortBy[#DateObject &]]

Cryptocurrency time series with events

In this section we discuss possible correlation and causation effects of reported cryptocurrency events.

Remark: The discussion is based on time series and events only, without considering other operational properties of the cryptocurrencies.

Here is a date range:

dateRange = {"May 15 2021", "Jun 21 2021"};

Here get time series for the daily opening and closing prices for the selected date range:

aBTCPrices = ResourceFunction["CryptocurrencyData"]["BTC", {"Open", "Close"}, dateRange];
aETHPrices = ResourceFunction["CryptocurrencyData"]["ETH", {"Open", "Close"}, dateRange];
aCCVolume = ResourceFunction["CryptocurrencyData"][{"BTC", "ETH"}, "Volume", dateRange];

Here are the summaries for prices:

ResourceFunction["GridTableForm"][Map[ResourceFunction["RecordsSummary"][#["Values"], "USD"] &, #] & /@ <|"BTC" -> aBTCPrices, "ETH" -> aETHPrices|>]

Here are the summaries for trading volumes:

ResourceFunction["RecordsSummary"][#["Values"], "USD"] & /@ aCCVolume

Here we plot the cryptocurrency events with together with the Bitcoin (BTC) price time series:

CryptocurrencyPlot[{aBTCPrices, dsEventData}, PlotLabel -> "BTC daily prices", ImageSize -> 1200]

Here we plot the cryptocurrency events with together with the Ether (ETH) price time series:

CryptocurrencyPlot[{aETHPrices, dsEventData}, PlotRange -> {0.95, 1.05} MinMax[aETHPrices[[1]]["Values"]], PlotLabel -> "BTC daily prices", ImageSize -> 1200]

Here we plot the cryptocurrency events with together with the BTC trading volume time series:

CryptocurrencyPlot[{aCCVolume, dsEventData}, PlotLabel -> "BTC and ETH trading volumes", ImageSize -> 1200]

Observations

Going down

We can see that opening prices and volume going down correlate with:

The news announcement that China plans to crackdown on mining and trading
The news announcement Iran bans crypto mining
The Sichuan Provincial Development and Reform Commission and the Sichuan Energy Bureau issue of a joint notice, ordering local electricity companies to “screen, clean up and terminate” mining operations
The start of the “Bitcoin 2021” conference

Related conjectures:

We can easily conjecture that 1 and 2 made cryptocurrencies (Bitcoin) less attractive to miners or traders in China and Iran, hence the price and the volume went down.
The most active Bitcoin traders were attending the “Bitcoin 2021” conference, hence the price and volume went down.

Going up

We can see the prices and volume going up correlate with:

The news announcement of El Salvador adopting BTC as legal tender currency
The news announcement that US Justice Department recovered most of the ransom paid to the Colonial Pipeline hackers
The end of the “Bitcoin 2021” conference

Related conjectures:

Of course, a country deciding to use BTC as legal tender would make (some) traders willing to invest in BTC.
The announcement that USA Justice Department, have made (some) traders to more confidently invest in BTC.
- Although, the opposite could also happen – for some people if BTC can be recovered by law enforcement, then BTC is less attractive for financial transactions.
After the end of “Bitcoin 2021” conference the attending traders resumed their usual activity.
- That conjecture and the “start of Bitcoin 2021” conjecture above support each other.
- The same pattern is observed for both BTC and ETH trading volumes.

Time series correlations

In this section we compute and visualize correlations between the time series of a set of cryptocurrencies.

Getting time series data

Here are the cryptocurrencies we consider:

lsCCFocus = ResourceFunction["CryptocurrencyData"]["Cryptocurrencies"]

(*{"ADA", "BCH", "BNB", "BTC", "DOGE", "DOT1", "EOS", "ETC", "ETH", "FIL", "HEX", "ICP1", "LINK", "LTC", "MATIC", "SOL1", "THETA", "TRX", "UNI3", "USDC", "USDT", "VET", "XLM", "XMR", "XRP"}*)

The start date we use is the one that was 90 days ago:

startDate = DatePlus[Date[], -Quantity[90, "Days"]]

(*{2021, 3, 24, 13, 24, 42.303}*)

aTSOpen = ResourceFunction["CryptocurrencyData"][lsCCFocus, "Open", startDate];
aTSVolume = ResourceFunction["CryptocurrencyData"][lsCCFocus, "Volume", startDate];

dateRange = {startDate, Date[]};
aTSOpen2 = Quiet@TimeSeriesResample[#, Append[dateRange, "Day"]] & /@ aTSOpen;
aTSVolume2 = Quiet@TimeSeriesResample[#, Append[dateRange, "Day"]] & /@ aTSVolume;

Opening price time series

Show heat-map plot corresponding to the max-normalized time series with clustering:

matVals = Association["SparseMatrix" -> SparseArray[Values@Map[#["Values"]/Max[#["Values"]] &, aTSOpen2]],"RowNames" -> Keys[aTSOpen2], "ColumnNames" -> Range[Length[aTSOpen2[[1]]["Times"]]]];
HeatmapPlot[Map[# /. x_Association :> Keys[x] &, matVals], Dendrogram -> {True, False}, DistanceFunction -> {CosineDistance, None}, ImageSize -> 1200]

Derive correlation triplets using SpearmanRho :

lsCorTriplets = Flatten[Outer[{#1, #2, SpearmanRho[aTSOpen2[#1]["Values"], aTSOpen2[#2]["Values"]]} &, Keys@aTSOpen2, Keys@aTSOpen2], 1];
dsCorTriplets = Dataset[lsCorTriplets][All, AssociationThread[{"TS1", "TS2", "Correlation"}, #] &];
dsCorTriplets = dsCorTriplets[Select[#TS1 != #TS2 &]];

Show summary of the correlation triplets:

ResourceFunction["RecordsSummary"][dsCorTriplets]

Show correlations that too high or too low:

Dataset[Union[Normal@dsCorTriplets[Select[Abs[#Correlation] > 0.85 &]], "SameTest" -> (Sort[Values@#1] == Sort[Values@#2] &)]][ReverseSortBy[#Correlation &]]

Cross tabulate the correlation triplets and show the corresponding dataset:

dsMatCor = ResourceFunction["CrossTabulate"][dsCorTriplets]

Cross tabulate the correlation triplets and plot the corresponding matrix with heat-map plot:

matCor1 = ResourceFunction["CrossTabulate"][dsCorTriplets, "Sparse" -> True];
gr1 = HeatmapPlot[matCor1, Dendrogram -> {True, True}, DistanceFunction -> {CosineDistance, CosineDistance}, ImageSize -> Medium, PlotLabel -> "Opening price"]

Trading volume time series

Show heat-map plot corresponding to the max-normalized time series with clustering:

matVals = Association["SparseMatrix" -> SparseArray[Values@Map[#["Values"]/Max[#["Values"]] &, aTSVolume2]], "RowNames" -> Keys[aTSOpen2], "ColumnNames" -> Range[Length[aTSVolume2[[1]]["Times"]]]];
HeatmapPlot[Map[# /. x_Association :> Keys[x] &, matVals], Dendrogram -> {True, False}, DistanceFunction -> {CosineDistance, None}, ImageSize -> 1200]

Derive correlation triplets using SpearmanRho :

lsCorTriplets = Flatten[Outer[{#1, #2, SpearmanRho[aTSVolume2[#1]["Values"], aTSVolume2[#2]["Values"]]} &, Keys@aTSVolume2, Keys@aTSVolume2], 1];
dsCorTriplets = Dataset[lsCorTriplets][All, AssociationThread[{"TS1", "TS2", "Correlation"}, #] &];
dsCorTriplets = dsCorTriplets[Select[#TS1 != #TS2 &]];

Show summary of the correlation triplets:

ResourceFunction["RecordsSummary"][dsCorTriplets]

Show correlations that too high or too low:

Dataset[Union[Normal@dsCorTriplets[Select[Abs[#Correlation] > 0.85 &]], "SameTest" -> (Sort[Values@#1] == Sort[Values@#2] &)]][ReverseSortBy[#Correlation &]]

Cross tabulate the correlation triplets and show the corresponding dataset:

dsMatCor = ResourceFunction["CrossTabulate"][dsCorTriplets]

Cross tabulate the correlation triplets and plot the corresponding matrix with heat-map plot:

matCor2 = ResourceFunction["CrossTabulate"][dsCorTriplets, "Sparse" -> True];
gr2 = HeatmapPlot[matCor2, Dendrogram -> {True, True}, DistanceFunction -> {CosineDistance, CosineDistance}, ImageSize -> Medium, PlotLabel -> "Trading volume"]

Observations

Here are the correlation matrix plots above placed next to each other:

Row[{gr1, gr2}]

Generally speaking, the two clustering patterns are different. This is one of the reasons to do the nearest neighbor graph clusterings below.

Nearest neighbors graphs

In this section we create nearest neighbor graphs of the correlation matrices computed above and plot clusterings of the nodes.

Graphs overview

Here we create the nearest neighbor graphs:

aNNGraphsVertexRules = Association@MapThread[#2 -> Association[Thread[Rule[Normal[Transpose[#SparseMatrix]], #ColumnNames]]] &, {{matCor1, matCor2}, {"Open", "Volume"}}];

aNNGraphs = Association@MapThread[(gr = NearestNeighborGraph[Normal[Transpose[#SparseMatrix]], 4, GraphLayout -> "SpringEmbedding", VertexLabels -> Normal[aNNGraphsVertexRules[#2]]]; #2 -> Graph[EdgeList[gr], VertexLabels -> Normal[aNNGraphsVertexRules[#2]], ImageSize -> Large]) &, {{matCor1, matCor2}, {"Open", "Volume"}}];

Here we plot the graphs with clusters:

ResourceFunction["GridTableForm"][List @@@ Normal[CommunityGraphPlot[#, ImageSize -> 800] & /@ aNNGraphs], TableHeadings -> {"Property", "Communities of nearest neighbors graph"}, Background -> White, Dividers -> All]

Here are the corresponding time series plots for each cluster:

aClusterPlots = 
   Association@Map[
     Function[{prop}, 
      prop -> Map[
        DateListPlot[Log10 /@ ResourceFunction["CryptocurrencyData"][#, prop, dateRange]] &, 
        FindGraphCommunities[aNNGraphs[prop]] /. aNNGraphsVertexRules[prop]] 
     ], 
     Keys[aNNGraphs] 
    ];

ResourceFunction["GridTableForm"][List @@@ Normal[aClusterPlots], TableHeadings -> {"Property", "Cluster plots"}, Background -> White, Dividers -> All]

Other types of analysis

I investigated the data with several other methods:

Clustering with different methods and distance functions
Clustering after the application of Independent Component Analysis (ICA), [AAw5]
Time series analysis with Quantile Regression (QR), [AAw6]

None of the outcomes provided some “immediate”, notable insight. The analyses with ICA and QR, though, seem to provide some interesting and fruitful future explorations.

Load packages

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/SSparseMatrix.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]

Definitions

Clear[CryptocurrencyPlot];
CryptocurrencyPlot[{aCryptoCurrenciesData_Association, dsEventData_Dataset}, opts : OptionsPattern[]] := 
   Block[{aEventDateObject, aEventURL, aEventRank, grGrid, lsVals}, 
    
    aEventDateObject = Normal@dsEventData[Association, {#Event -> AbsoluteTime[#DateObject]} &]; 
    aEventURL = Normal@dsEventData[Association, {#Event -> Button[Mouseover[Style[#Event, Gray, FontSize -> 10], Style[#Event, Pink, FontSize -> 10]], NotebookLocate[{#URL, None}], Appearance -> None]} &]; aEventRank = Block[{k = 1}, Normal@dsEventData[Association, {#Event -> (k++)/Length[dsEventData]} &]]; 
    
    lsVals = Flatten@Map[#["Values"] &, Values@aCryptoCurrenciesData];
    grGrid = 
     DateListPlot[
      KeyValueMap[Callout[{#2, Rescale[aEventRank[#1], {0, 1}, MinMax[lsVals]]}, aEventURL[#1], Right] &, Sort@aEventDateObject], 
      PlotStyle -> {Gray, Opacity[0.3], PointSize[0.0035]}, 
      Joined -> False, 
      GridLines -> {Sort@Values[aEventDateObject], None} 
     ]; 
    Show[
     DateListPlot[
      aCryptoCurrenciesData, 
      opts, 
      GridLines -> {Sort@Values[aEventDateObject], None}, 
      PlotRange -> All, 
      AspectRatio -> 1/4, 
      ImageSize -> Large 
     ], 
     grGrid 
    ] 
   ];
CryptocurrencyPlot[___] := $Failed;

References

Articles

[AA1] Anton Antonov, “Crypto-currencies data acquisition with visualization”, (2021), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub.

[AA3] Anton Antonov, “Apple mobility trends data visualization”, (2020), SystemModeling at GitHub.

Packages

[AAp1] Anton Antonov, Data reshaping Mathematica package, (2018), MathematicaForPrediciton at GitHub.

[AAp2] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.

Resource functions

[AAw1] Anton Antonov, CryptocurrencyData, (2021).

[AAw2] Anton Antonov, RecordsSummary, (2019).

[AAw3] Anton Antonov, ParetoPrinciplePlot, (2019).

[AAw4] Anton Antonov, CrossTabulate, (2019).

[AAw5] Anton Antonov, IndependentComponentAnalysis, (2019).

[AAw6] Anton Antonov, QuantileRegression, (2019).

Time series search engines over COVID-19 data

Posted on January 27, 2021 by Anton Antonov Antonov

Introduction

In this article we proclaim the preparation and availability of interactive interfaces to two Time Series Search Engines (TSSEs) over COVID-19 data. One TSSE is based on Apple Mobility Trends data, [APPL1]; the other on The New York Times COVID-19 data, [NYT1].

Here are links to interactive interfaces of the TSSEs hosted (and publicly available) at shinyapps.io by RStudio:

Motivation: The primary motivation for making the TSSEs and their interactive interfaces is to use them as exploratory tools. Combined with relevant data analysis (e.g. [AA1, AA2]) the TSSEs should help to form better intuition and feel of the spread of COVID-19 and related data aggregation, public reactions, and government polices.

The rest of the article is structured as follows:

Brief descriptions the overall process, the data
Brief descriptions the search engines structure and implementation
Discussions of a few search examples and their (possible) interpretations

The overall process

For both search engines the overall process has the same steps:

Ingest the data
Do basic (and advanced) data analysis
Make (and publish) reports detailing the data ingestion and transformation steps
Enhance the data with transformed versions of it or with additional related data
Make a Time Series Sparse Matrix Recommender (TSSMR)
Make a Time Series Search Engine Interactive Interface (TSSEII)
Make the interactive interface easily accessible over the World Wide Web

Here is a flow chart that corresponds to the steps listed above:

Data

The Apple data

The Apple Mobility Trends data is taken from Apple’s site, see [APPL1]. The data ingestion, basic data analysis, time series seasonality demonstration, (graph) clusterings are given in [AA1]. (Here is a link to the corresponding R-notebook .)

The weather data was taken using the Mathematica function WeatherData, [WRI1].

(It was too much work to get the weather data using some of the well known weather data R packages.)

The New York Times data

The New York Times COVID-19 data is taken from GitHub, see [NYT1]. The data ingestion, basic data analysis, and visualizations are given in [AA2]. (Here is a link to the corresponding R-notebook .)

The search engines

The following sub-sections have screenshots of the TSSE interactive interfaces.

I did experiment with combining the data of the two engines, but did not turn out to be particularly useful. It seems that is more interesting and useful to enhance the Apple data engine with temperature data, and to enhance The New Your Times engine with the (consecutive) differences of the time series.

Structure

The interactive interfaces have three panels:

Nearest Neighbors
- Gives the time series nearest neighbors for the time series of selected entity.
- Has interactive controls for entity selection and filtering.
Trend Finding
- Gives the time series that adhere to a specified named trend.
- Has interactive controls for trend curves selection and entity filtering.
Notes
- Gives references and data objects summary.

Implementation

Both TSSEs are implemented using the R packages “SparseMatrixRecommender”, [AAp1], and “SparseMatrixRecommenderInterfaces”, [AAp2].

The package “SparseMatrixRecommender” provides functions to create and use Sparse Matrix Recommender (SMR) objects. Both TSSEs use underlying SMR objects.

The package “SparseMatrixRecommenderInterfaces” provides functions to generate the server and client functions for the Shiny framework by RStudio.

As it was mentioned above, both TSSEs are published at shinyapps.io. The corresponding source codes can be found in [AAr1].

Apple Mobility Trends Reports Search Engine

The Apple data TSSE has four types of time series (“entities”). The first three are normalized volumes of Apple maps requests while driving, transit transport use, and walking. (See [AA1] for more details.) The fourth is daily mean temperature at different geo-locations.

Here are screenshots of the panels “Nearest Neighbors” and “Trend Finding” (at interface launch):

The New York Times COVID-19 Data Search Engine

The New York Times TSSE has four types of time series (aggregated) cases and deaths, and their corresponding time series differences.

Here are screenshots of the panels “Nearest Neighbors” and “Trend Finding” (at interface launch):

Examples

In this section we discuss in some detail several examples of using each of the TSSEs.

Apple data search engine examples

Here are a few observations from [AA1]:

The COVID-19 lockdowns are clearly reflected in the time series.
The time series from the Apple Mobility Trends data shows strong weekly seasonality. Roughly speaking, people go to places they are not familiar with on Fridays and Saturdays. Other work week days people are more familiar with their trips. Since much lesser number of requests are made on Sundays, we can conjecture that many people stay at home or visit very familiar locations.

Here are a few assumptions:

Where people frequently go (work, school, groceries shopping, etc.) they do not need directions that much.
People request directions when they have more free time and will for “leisure trips.”
During vacations people are more likely to be in places they are less familiar with.
People are more likely to take leisure trips when the weather is good. (Warm, not raining, etc.)

Nice, France vs Florida, USA

Consider the results of the Nearest Neighbors panel for Nice, France.

Since French tend to go on vacation in July and August ([SS1, INSEE1]) we can see that driving, transit, and walking in Nice have pronounced peaks during that time:

Of course, we also observe the lockdown period in that geographical area.

Compare those time series with the time series from driving in Florida, USA:

We can see that people in Florida, USA have driving patterns unrelated to the typical weather seasons and vacation periods.

(Further TSSE queries show that there is a negative correlation with the temperature in south Florida and the volumes of Apple Maps directions requests.)

Italy and Balkan countries driving

We can see that according to the data people who have access to both iPhones and cars in Italy and the Balkan countries Bulgaria, Greece, and Romania have similar directions requests patterns:

(The similarities can be explained with at least a few “obvious” facts, but we are going to restrain ourselves.)

The New York Times data search engine examples

In Broward county, Florida, USA and Cook county, Illinois, USA we can see two waves of infections in the difference time series:

References

Data

[APPL1] Apple Inc., Mobility Trends Reports, (2020), apple.com.

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[WRI1] Wolfram Research (2008), WeatherData, Wolfram Language function.

Articles

[AA1] Anton Antonov, “Apple mobility trends data visualization (for COVID-19)”, (2020), SystemModeling at GitHub/antononcube.

[AA2] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub/antononcube.

[INSEE1] Institut national de la statistique et des études économiques, “En 2010, les salariés ont pris en moyenne six semaines de congé”, (2012).

[SS1] Sam Schechner and Lee Harris, “What Happens When All of France Takes Vacation? 438 Miles of Traffic”, (2019), The Wall Street Journal

Packages, repositories

[AAp1] Anton Antonov, Sparse Matrix Recommender framework functions, (2019), R-packages at GitHub/antononcube.

[AAp2] Anton Antonov, Sparse Matrix Recommender framework interface functions, (2019), R-packages at GitHub/antononcube.

[AAr1] Anton Antonov, Coronavirus propagation dynamics, (2020), SystemModeling at GitHub/antononcube.

NY Times COVID-19 data visualization (Update)

Posted on January 15, 2021 by Anton Antonov Antonov

Introduction

This post is both an update and a full-blown version of an older post — “NY Times COVID-19 data visualization” — using NY Times COVID-19 data up to 2021-01-13.

The purpose of this document/notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].

The following steps are taken:

Ingest data
- Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
- Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
Merge the data.
Make data summaries and related plots.
Make corresponding geo-plots.
Do “out of the box” time series forecast.
Analyze fluctuations around time series trends.

Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].

Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.

Import data

NYTimes USA states data

dsNYDataStates = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"];
dsNYDataStates = dsNYDataStates[All, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataStates[[1 ;; 6]]

ResourceFunction["RecordsSummary"][dsNYDataStates]

NYTimes USA counties data

dsNYDataCounties = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"];
dsNYDataCounties = dsNYDataCounties[All, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataCounties[[1 ;; 6]]

ResourceFunction["RecordsSummary"][dsNYDataCounties]

US county records

dsUSACountyData = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv"];
dsUSACountyData = dsUSACountyData[All, Join[#, <|"FIPS" -> ToExpression[#FIPS]|>] &];
dsUSACountyData[[1 ;; 6]]

ResourceFunction["RecordsSummary"][dsUSACountyData]

Merge data

Verify that the two datasets have common FIPS codes:

Length[Intersection[Normal[dsUSACountyData[All, "FIPS"]], Normal[dsNYDataCounties[All, "Fips"]]]]

(*3133*)

Merge the datasets:

dsNYDataCountiesExtended = Dataset[JoinAcross[Normal[dsNYDataCounties], Normal[dsUSACountyData[All, {"FIPS", "Lat", "Lon", "Population"}]], Key["Fips"] -> Key["FIPS"]]];

Add a “DateObject” column and (reverse) sort by date:

dsNYDataCountiesExtended = dsNYDataCountiesExtended[All, Join[<|"DateObject" -> DateObject[#Date]|>, #] &];
dsNYDataCountiesExtended = dsNYDataCountiesExtended[ReverseSortBy[#DateObject &]];
dsNYDataCountiesExtended[[1 ;; 6]]

Basic data analysis

We consider cases and deaths for the last date only. (The queries can be easily adjusted for other dates.)

dfQuery = dsNYDataCountiesExtended[Select[#Date == dsNYDataCountiesExtended[1, "Date"] &], {"FIPS", "Cases", "Deaths"}];
dfQuery = dfQuery[All, Prepend[#, "FIPS" -> ToString[#FIPS]] &];

Total[dfQuery[All, {"Cases", "Deaths"}]]

(*<|"Cases" -> 22387340, "Deaths" -> 355736|>*)

Here is the summary of the values of cases and deaths across the different USA counties:

ResourceFunction["RecordsSummary"][dfQuery]

The following table of plots shows the distributions of cases and deaths and the corresponding Pareto principle adherence plots:

opts = {PlotRange -> All, ImageSize -> Medium};
Rasterize[Grid[
   Function[{columnName}, 
     {Histogram[Log10[#], PlotLabel -> Row[{Log10, Spacer[3], columnName}], opts], ResourceFunction["ParetoPrinciplePlot"][#, PlotLabel -> columnName, opts]} &@Normal[dfQuery[All, columnName]] 
    ] /@ {"Cases", "Deaths"}, 
   Dividers -> All, FrameStyle -> GrayLevel[0.7]]]

A couple of observations:

The logarithms of the cases and deaths have nearly Normal or Logistic distributions.
Typical manifestation of the Pareto principle: 80% of the cases and deaths are registered in 20% of the counties.

Remark: The top 20% counties of the cases are not necessarily the same as the top 20% counties of the deaths.

Distributions

Here we find the distributions that correspond to the cases and deaths (using FindDistribution ):

ResourceFunction["GridTableForm"][List @@@ Map[Function[{columnName}, 
     columnName -> FindDistribution[N@Log10[Select[#, # > 0 &]]] &@Normal[dfQuery[All, columnName]] 
    ], {"Cases", "Deaths"}], TableHeadings -> {"Data", "Distribution"}]

Pareto principle locations

The following query finds the intersection between that for the top 600 Pareto principle locations for the cases and deaths:

Length[Intersection @@ Map[Function[{columnName}, Keys[TakeLargest[Normal@dfQuery[Association, #FIPS -> #[columnName] &], 600]]], {"Cases", "Deaths"}]]

(*516*)

Geo-histogram

lsAllDates = Union[Normal[dsNYDataCountiesExtended[All, "Date"]]];
lsAllDates // Length

(*359*)

Manipulate[
  DynamicModule[{ds = dsNYDataCountiesExtended[Select[#Date == datePick &]]}, 
   GeoHistogram[
    Normal[ds[All, {"Lat", "Lon"}][All, Values]] -> N[Normal[ds[All, columnName]]], 
    Quantity[150, "Miles"], PlotLabel -> columnName, PlotLegends -> Automatic, ImageSize -> Large, GeoProjection -> "Equirectangular"] 
  ], 
  {{columnName, "Cases", "Data type:"}, {"Cases", "Deaths"}}, 
  {{datePick, Last[lsAllDates], "Date:"}, lsAllDates}]

Heat-map plots

An alternative of the geo-visualization is to use a heat-map plot. Here we use the package “HeatmapPlot.m”, [AAp1].

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]

Cases

Cross-tabulate states with dates over cases:

matSDC = ResourceFunction["CrossTabulate"][dsNYDataStates[All, {"State", "Date", "Cases"}], "Sparse" -> True];

Make a heat-map plot by sorting the columns of the cross-tabulation matrix (that correspond to states):

HeatmapPlot[matSDC, DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000]

Deaths

Cross-tabulate states with dates over deaths and plot:

matSDD = ResourceFunction["CrossTabulate"][dsNYDataStates[All, {"State", "Date", "Deaths"}], "Sparse" -> True];
HeatmapPlot[matSDD, DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000]

Time series analysis

Cases

Time series

For each date sum all cases over the states, make a time series, and plot it:

tsCases = TimeSeries@(List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Cases & /@ #] &]]);
opts = {PlotTheme -> "Detailed", PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
DateListPlot[tsCases, PlotLabel -> "Cases", opts]

ResourceFunction["RecordsSummary"][tsCases["Path"]]

Logarithmic plot:

DateListPlot[Log10[tsCases], PlotLabel -> Row[{Log10, Spacer[3], "Cases"}], opts]

“Forecast”

Fit a time series model to log 10 of the time series:

tsm = TimeSeriesModelFit[Log10[tsCases]]

Plot log 10 data and forecast:

DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]

Plot data and forecast:

DateListPlot[{tsCases, 10^TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]

Deaths

Time series

For each date sum all cases over the states, make a time series, and plot it:

tsDeaths = TimeSeries@(List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Deaths & /@ #] &]]);
opts = {PlotTheme -> "Detailed", PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
DateListPlot[tsDeaths, PlotLabel -> "Deaths", opts]

ResourceFunction["RecordsSummary"][tsDeaths["Path"]]

“Forecast”

Fit a time series model:

tsm = TimeSeriesModelFit[tsDeaths, "ARMA"]

Plot data and forecast:

DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]

Fluctuations

We want to see does the time series data have fluctuations around its trends and estimate the distributions of those fluctuations. (Knowing those distributions some further studies can be done.)

This can be efficiently using the software monad QRMon, [AAp2, AA1]. Here we load the QRMon package:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicQuantileRegression.m"]

Fluctuations presence

Here we plot the consecutive differences of the cases:

DateListPlot[Differences[tsCases], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]

Here we plot the consecutive differences of the deaths:

DateListPlot[Differences[tsDeaths], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]

From the plots we see that time series are not monotonically increasing, and there are non-trivial fluctuations in the data.

Absolute and relative errors distributions

Here we take interesting part of the cases data:

tsData = TimeSeriesWindow[tsCases, {{2020, 5, 1}, {2020, 12, 31}}];

Here we specify QRMon workflow that rescales the data, fits a B-spline curve to get the trend, and finds the absolute and relative errors (residuals, fluctuations) around that trend:

qrObj = 
   QRMonUnit[tsData]⟹
    QRMonEchoDataSummary⟹
    QRMonRescale[Axes -> {False, True}]⟹
    QRMonEchoDataSummary⟹
    QRMonQuantileRegression[16, 0.5]⟹
    QRMonSetRegressionFunctionsPlotOptions[{PlotStyle -> Red}]⟹
    QRMonDateListPlot[AspectRatio -> 1/4, ImageSize -> Large]⟹
    QRMonErrorPlots["RelativeErrors" -> False, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True]⟹
    QRMonErrorPlots["RelativeErrors" -> True, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True];

Here we find the distribution of the absolute errors (fluctuations) using FindDistribution:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> False]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[All, 2]]]

(*CauchyDistribution[6.0799*10^-6, 0.000331709]*)

Absolute errors distributions for the last 90 days:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> False]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[-90 ;; -1, 2]]]

(*ExtremeValueDistribution[-0.000996315, 0.00207593]*)

Here we find the distribution of the of the relative errors:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> True]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[All, 2]]]

(*StudentTDistribution[0.0000511326, 0.00244023, 1.59364]*)

Relative errors distributions for the last 90 days:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> True]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[-90 ;; -1, 2]]]

(*NormalDistribution[9.66949*10^-6, 0.00394395]*)

References

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[WRI1] Wolfram Research Inc., USA county records, (2020), System Modeling at GitHub.

[JH1] CSSE at Johns Hopkins University, COVID-19, (2020), GitHub.

[VK1] Vitaliy Kaurov, Resources For Novel Coronavirus COVID-19, (2020), community.wolfram.com.

[AA1] Anton Antonov, “A monad for Quantile Regression workflows”, (2018), at MathematicaForPrediction WordPress.

[AAp1] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.

[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediciton at GitHub.

Apple mobility trends data visualization (for COVID-19)

Posted on April 17, 2020 by Anton Antonov Antonov

Introduction

I this notebook/document we ingest and visualize the mobility trends data provided by Apple, [APPL1].

We take the following steps:

Download the data
Import the data and summarise it
Transform the data into long form
Partition the data into subsets that correspond to combinations of geographical regions and transportation types
Make contingency matrices and corresponding heat-map plots
Make nearest neighbors graphs over the contingency matrices and plot communities
Plot the corresponding time series

Data description

From Apple’s page https://www.apple.com/covid19/mobility

About This Data The CSV file and charts on this site show a relative volume of directions requests per country/region or city compared to a baseline volume on January 13th, 2020. We define our day as midnight-to-midnight, Pacific time. Cities represent usage in greater metropolitan areas and are stably defined during this period. In many countries/regions and cities, relative volume has increased since January 13th, consistent with normal, seasonal usage of Apple Maps. Day of week effects are important to normalize as you use this data. Data that is sent from users’ devices to the Maps service is associated with random, rotating identifiers so Apple doesn’t have a profile of your movements and searches. Apple Maps has no demographic information about our users, so we can’t make any statements about the representativeness of our usage against the overall population.

Observations

The observations listed in this subsection are also placed under the relevant statistics in the following sections and indicated with “Observation”.

The directions request volumes reference date for normalization is 2020-01-13 : all the values in that column are 100.
From the community clusters of the nearest neighbor graphs (derived from the time series of the normalized driving directions requests volume) we see that countries and cities are clustered in expected ways. For example, in the community graph plot corresponding to “{city, driving}” the cities Oslo, Copenhagen, Helsinki, Stockholm, and Zurich are placed in the same cluster. In the graphs corresponding to “{city, transit}” and “{city, walking}” the Japanese cities Tokyo, Osaka, Nagoya, and Fukuoka are clustered together.
In the time series plots the Sundays are indicated with orange dashed lines. We can see that from Monday to Thursday people are more familiar with their trips than say on Fridays and Saturdays. We can also see that on Sundays people (on average) are more familiar with their trips or simply travel less.

Load packages

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/DataReshape.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]

Data ingestion

Apple mobile data was provided in this WWW page: https://www.apple.com/covid19/mobility , [APPL1]. (The data has to be download from that web page – there is an “agreement to terms”, etc.)

dsAppleMobility = ResourceFunction["ImportCSVToDataset"]["~/Downloads/applemobilitytrends-2021-01-15.csv"]

Observation: The directions requests volumes reference date for normalization is 2020-01-13 : all the values in that column are 100.

Data dimensions:

Dimensions[dsAppleMobility]

(*{4691, 375}*)

Data summary:

Magnify[ResourceFunction["RecordsSummary"][dsAppleMobility], 0.6]

Number of unique “country/region” values:

Length[Union[Normal[dsAppleMobility[Select[#["geo_type"] == "country/region" &], "region"]]]]

(*63*)

Number of unique “city” values:

Length[Union[Normal[dsAppleMobility[Select[#["geo_type"] == "city" &], "region"]]]]

(*295*)

All unique geo types:

lsGeoTypes = Union[Normal[dsAppleMobility[All, "geo_type"]]]

(*{"city", "country/region", "county", "sub-region"}*)

All unique transportation types:

lsTransportationTypes = Union[Normal[dsAppleMobility[All, "transportation_type"]]]

(*{"driving", "transit", "walking"}*)

Data transformation

It is better to have the data in long form (narrow form). For that I am using the package “DataReshape.m”, [AAp1].

(*lsIDColumnNames={"geo_type","region","transportation_type"};*) (*For the initial dataset of Apple's mobility data.*)
  lsIDColumnNames = {"geo_type", "region", "transportation_type", "alternative_name", "sub-region", "country"}; 
   dsAppleMobilityLongForm = ToLongForm[dsAppleMobility, lsIDColumnNames, Complement[Keys[dsAppleMobility[[1]]], lsIDColumnNames]]; 
   Dimensions[dsAppleMobilityLongForm]

(*{1730979, 8}*)

Remove the rows with “empty” values:

dsAppleMobilityLongForm = dsAppleMobilityLongForm[Select[#Value != "" &]];
Dimensions[dsAppleMobilityLongForm]

(*{1709416, 8}*)

Rename the column “Variable” to “Date” and add a related “DateObject” column:

AbsoluteTiming[
  dsAppleMobilityLongForm = dsAppleMobilityLongForm[All, Join[KeyDrop[#, "Variable"], <|"Date" -> #Variable, "DateObject" -> DateObject[#Variable]|>] &]; 
 ]

(*{714.062, Null}*)

Add “day name” (“day of the week”) field:

AbsoluteTiming[
  dsAppleMobilityLongForm = dsAppleMobilityLongForm[All, Join[#, <|"DayName" -> DateString[#DateObject, {"DayName"}]|>] &]; 
 ]

(*{498.026, Null}*)

Here is sample of the transformed data:

SeedRandom[3232];
RandomSample[dsAppleMobilityLongForm, 12]

Here is summary:

ResourceFunction["RecordsSummary"][dsAppleMobilityLongForm]

Partition the data into geo types × transportation types:

aQueries = Association@Flatten@Outer[Function[{gt, tt}, {gt, tt} -> dsAppleMobilityLongForm[Select[#["geo_type"] == gt && #["transportation_type"] == tt &]]], lsGeoTypes, lsTransportationTypes];

aQueries = Select[aQueries, Length[#] > 0 &];
Keys[aQueries]

(*{{"city", "driving"}, {"city", "transit"}, {"city", "walking"}, {"country/region", "driving"}, {"country/region", "transit"}, {"country/region", "walking"}, {"county", "driving"}, {"county", "transit"}, {"county", "walking"}, {"sub-region", "driving"}, {"sub-region", "transit"}, {"sub-region", "walking"}}*)

Basic data analysis

We consider relative volume o directions requests for the last date only. (The queries can easily adjusted for other dates.)

lastDate = Last@Sort@Normal@dsAppleMobilityLongForm[All, "Date"]

(*"2021-01-15"*)

aDayQueries = Association@Flatten@Outer[Function[{gt, tt}, {gt, tt} -> dsAppleMobilityLongForm[Select[#["geo_type"] == gt && #Date == lastDate && #["transportation_type"] == tt &]]], lsGeoTypes, lsTransportationTypes];

Dimensions /@ aDayQueries

(*<|{"city", "driving"} -> {299, 10}, {"city", "transit"} -> {197, 10}, {"city", "walking"} -> {294, 10}, {"country/region", "driving"} -> {63, 10}, {"country/region", "transit"} -> {27, 10}, {"country/region", "walking"} -> {63, 10}, {"county", "driving"} -> {2090, 10}, {"county", "transit"} -> {152, 10}, {"county", "walking"} -> {396, 10}, {"sub-region", "driving"} -> {557, 10}, {"sub-region", "transit"} -> {175, 10}, {"sub-region", "walking"} -> {339, 10}|>*)

Here we plot histograms and Pareto principle adherence:

opts = {PlotRange -> All, ImageSize -> Medium};
Grid[
    Function[{columnName}, 
      {Histogram[#, 12, PlotLabel -> columnName, opts], ResourceFunction["ParetoPrinciplePlot"][#, PlotLabel -> columnName, opts]} &@Normal[#[All, "Value"]] 
     ] /@ {"Value"}, 
    Dividers -> All, FrameStyle -> GrayLevel[0.7]] & /@ aDayQueries

Heat-map plots

We can visualize the data using heat-map plots. Here we use the package “HeatmapPlot.m”, [AAp2].

Remark: Using the contingency matrices prepared for the heat-map plots we can do further analysis, like, finding correlations or nearest neighbors. (See below.)

Cross-tabulate dates with regions:

aMatDateRegion = ResourceFunction["CrossTabulate"][#[All, {"Date", "region", "Value"}], "Sparse" -> True] & /@ aQueries;

Make a heat-map plot by sorting the columns of the cross-tabulation matrix (that correspond to countries):

aHeatMapPlots = Association@KeyValueMap[#1 -> Rasterize[HeatmapPlot[#2, PlotLabel -> #1, DistanceFunction -> {None, EuclideanDistance}, AspectRatio -> 1/1.6, ImageSize -> 1600]] &, aMatDateRegion]

(We use Rasterize in order to reduce the size of the notebook.)

Here we take closer look to one of the plots:

aHeatMapPlots[{"country/region", "driving"}]

Nearest neighbors graphs

Graphs overview

Here we create nearest neighbor graphs of the contingency matrices computed above and plot cluster the nodes:

Manipulate[
  Multicolumn[Normal@Map[CommunityGraphPlot@Graph@EdgeList@NearestNeighborGraph[Normal[Transpose[#SparseMatrix]], nns, ImageSize -> Medium] &, aMatDateRegion], 2, Dividers -> All], 
  {{nns, 5, "Number of nearest neighbors:"}, 2, 30, 1, Appearance -> "Open"}, SaveDefinitions -> True]

Closer look into the graphs

Here we endow each nearest neighbors graph with appropriate vertex labels:

aNNGraphs = Map[(gr = NearestNeighborGraph[Normal[Transpose[#SparseMatrix]], 4, GraphLayout -> "SpringEmbedding", VertexLabels -> Thread[Rule[Normal[Transpose[#SparseMatrix]], #ColumnNames]]];Graph[EdgeList[gr], VertexLabels -> Thread[Rule[Normal[Transpose[#SparseMatrix]], #ColumnNames]]]) &, aMatDateRegion];

Here we plot the graphs with clusters:

ResourceFunction["GridTableForm"][List @@@ Normal[CommunityGraphPlot[#, ImageSize -> 800] & /@ aNNGraphs], TableHeadings -> {"region & transportation type", "communities of nearest neighbors graph"}, Background -> White, Dividers -> All]

Observation: From the community clusters of the nearest neighbor graphs (derived from the time series of the normalized driving directions requests volume) we see that countries and cities are clustered in expected ways. For example in the community graph plot corresponding to “{city, driving}” the cities Oslo, Copenhagen, Helsinki, Stockholm, and Zurich are placed in the same cluster. In the graphs corresponding to “{city, transit}” and “{city, walking}” the Japanese cities Tokyo, Osaka, Nagoya, and Fukuoka are clustered together.

Time series analysis

Time series

In this section for each date we sum all cases over the region-transportation pairs, make a time series, and plot them.

Remark: In the plots the Sundays are indicated with orange dashed lines.

Here we make the time series:

aTSDirReqByCountry = 
  Map[
   Function[{dfQuery}, 
    TimeSeries@(List @@@ Normal[GroupBy[Normal[dfQuery], #DateObject &, Total[#Value & /@ #] &]]) 
   ], 
   aQueries 
  ]

Here we plot them:

opts = {PlotTheme -> "Detailed", PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
Association@KeyValueMap[
   Function[{transpType, ts}, 
    transpType -> 
     DateListPlot[ts, GridLines -> {AbsoluteTime /@ Union[Normal[dsAppleMobilityLongForm[Select[#DayName == "Sunday" &], "DateObject"]]], Automatic}, GridLinesStyle -> {Directive[Orange, Dashed], Directive[Gray, Dotted]}, PlotLabel -> Capitalize[transpType], opts] 
   ], 
   aTSDirReqByCountry 
  ]

Observation: In the time series plots the Sundays are indicated with orange dashed lines. We can see that from Monday to Thursday people are more familiar with their trips than say on Fridays and Saturdays. We can also see that on Sundays people (on average) are more familiar with their trips or simply travel less.

“Forecast”

He we do “forecast” for code-workflow demonstration purposes – the forecasts should not be taken seriously.

Fit a time series model to the time series:

aTSModels = TimeSeriesModelFit /@ aTSDirReqByCountry

Plot data and forecast:

Map[DateListPlot[{#["TemporalData"], TimeSeriesForecast[#, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}] &, aTSModels]

References

[APPL1] Apple Inc., Mobility Trends Reports, (2020), apple.com.

[AA1] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub.

[AAp1] Anton Antonov, Data reshaping Mathematica package, (2018), MathematicaForPrediciton at GitHub.

[AAp2] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.

NY Times COVID-19 data visualization

Posted on March 28, 2020 by Anton Antonov Antonov

Yesterday in one of the forums I frequent it was announced that New York Times has published COVID-19 data on GitHub. I decided to make a Mathematica notebook that gives data links and related code for data ingestions. (And rudimentary data analysis.)

Here is the Markdown version of the notebook: “NY Times COVID-19 data visualization”.

Here is a screenshot of the WL notebook that also links to it:

Screenshot of an interactive interface:

Histograms and Pareto principle adherence:

Conference abstracts similarities

Posted on January 27, 2020 by Anton Antonov Antonov

Introduction

In this MathematicaVsR project we discuss and exemplify finding and analyzing similarities between texts using Latent Semantic Analysis (LSA). Both Mathematica and R codes are provided.

The LSA workflows are constructed and executed with the software monads LSAMon-WL, [AA1, AAp1], and LSAMon-R, [AAp2].

The illustrating examples are based on conference abstracts from rstudio::conf and Wolfram Technology Conference (WTC), [AAd1, AAd2]. Since the number of rstudio::conf abstracts is small and since rstudio::conf 2020 is about to start at the time of preparing this project we focus on words and texts from RStudio’s ecosystem of packages and presentations.

Statistical thesaurus for words from RStudio’s ecosystem

Consider the focus words:

{"cloud","rstudio","package","tidyverse","dplyr","analyze","python","ggplot2","markdown","sql"}

Here is a statistical thesaurus for those words:

Remark: Note that the computed thesaurus entries seem fairly “R-flavored.”

Similarity analysis diagrams

As expected the abstracts from rstudio::conf tend to cluster closely – note the square formed top-left in the plot of a similarity matrix based on extracted topics:

Here is a similarity graph based on the matrix above:

Here is a clustering (by “graph communities”) of the sub-graph highlighted in the plot above:

Notebooks

Mathematica
- ConferenceAbstractsSimilarities.md
R
- ConferenceAbstractsSimilarities.Rmd
- ConferenceAbstractsSimilarities.nb.html

Comparison observations

LSA pipelines specifications

The packages LSAMon-WL, [AAp1], and LSAMon-R, [AAp2], make the comparison easy – the codes of the specified workflows are nearly identical.

Here is the Mathematica code:

lsaObj =
  LSAMonUnit[aDesriptions]⟹
   LSAMonMakeDocumentTermMatrix[{}, Automatic]⟹
   LSAMonEchoDocumentTermMatrixStatistics⟹
   LSAMonApplyTermWeightFunctions["IDF", "TermFrequency", "Cosine"]⟹
   LSAMonExtractTopics["NumberOfTopics" -> 36, "MinNumberOfDocumentsPerTerm" -> 2, Method -> "ICA", MaxSteps -> 200]⟹
   LSAMonEchoTopicsTable["NumberOfTableColumns" -> 6];

Here is the R code:

lsaObj <- 
  LSAMonUnit(lsDescriptions) %>% 
  LSAMonMakeDocumentTermMatrix( stemWordsQ = FALSE, stopWords = stopwords::stopwords() ) %>% 
  LSAMonApplyTermWeightFunctions( "IDF", "TermFrequency", "Cosine" ) 
  LSAMonExtractTopics( numberOfTopics = 36, minNumberOfDocumentsPerTerm = 5, method = "NNMF", maxSteps = 20, profilingQ = FALSE ) %>% 
  LSAMonEchoTopicsTable( numberOfTableColumns = 6, wideFormQ = TRUE )

Graphs and graphics

Mathematica’s built-in graph functions make the exploration of the similarities much easier. (Than using R.)

Mathematica’s matrix plots provide more control and are more readily informative.

Sparse matrix objects with named rows and columns

R’s built-in sparse matrices with named rows and columns are great. LSAMon-WL utilizes a similar, specially implemented sparse matrix object, see [AA1, AAp3].

References

Articles

[AA1] Anton Antonov, A monad for Latent Semantic Analysis workflows, (2019), MathematicaForPrediction at GitHub.

[AA2] Anton Antonov, Text similarities through bags of words, (2020), SimplifiedMachineLearningWorkflows-book at GitHub.

Data

[AAd1] Anton Antonov, RStudio::conf-2019-abstracts.csv, (2020), SimplifiedMachineLearningWorkflows-book at GitHub.

[AAd2] Anton Antonov, Wolfram-Technology-Conference-2016-to-2019-abstracts.csv, (2020), SimplifiedMachineLearningWorkflows-book at GitHub.

Packages

[AAp1] Anton Antonov, Monadic Latent Semantic Analysis Mathematica package, (2017), MathematicaForPrediction at GitHub.

[AAp2] Anton Antonov, Latent Semantic Analysis Monad R package, (2019), R-packages at GitHub.

[AAp3] Anton Antonov, SSparseMatrix Mathematica package, (2018), MathematicaForPrediction at GitHub.

Pets licensing data analysis

Posted on January 20, 2020 by Anton Antonov Antonov

Introduction

This notebook / document provides ground data analysis used to make or confirm certain modeling conjectures and assumptions of a Pets Retail Dynamics Model (PRDM), [AA1]. Seattle pets licensing data is used, [SOD2].

We want to provide answers to the following questions.

Does the Pareto principle manifests for pets breeds?
Does the Pareto principle manifests for ZIP codes?
Is there an upward trend for becoming a pet owner?

All three questions have positive answers, assuming the retrieved data, [SOD2], is representative. See the last section for an additional discussion.

We also discuss pet adoption simulations that are done using Quantile Regression, [AA2, AAp1].

This notebook/document is part of the SystemsModeling at GitHub project “Pets retail dynamics”, [AA1].

Data

The pet licensing data was taken from this page: “Seattle Pet Licenses”, https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb/data.

The ZIP code coordinates data was taken from a GitHub repository,
“US Zip Codes from 2013 Government Data”, https://gist.github.com/erichurst/7882666.

Animal licenses

dsPetLicenses=ResourceFunction["ImportCSVToDataset"]["~/Datasets/Seattle/Seattle_Pet_Licenses.csv"]

image-3281001a-2f3d-4a8e-87b9-dc8a8b9803b3

Convert “Licence Issue Date” values into DateObjects.

dsPetLicenses=dsPetLicenses[All,Prepend[#,"DateObject"->DateObject[{#[[1]],{"Month","Day","Year"}}]]&];

Summary

ResourceFunction["RecordsSummary"][dsPetLicenses]

image-49aecba4-2b43-40d7-87ba-15ceb848898d

Keep dogs and cats only

Since the number of animals that are not cats or dogs is very small we remove them from the data in order to produce more concise statistics.

dsPetLicenses=dsPetLicenses[Select[MemberQ[{"Cat","Dog"},#Species]&]];
    Dimensions[dsPetLicenses]

(*{49191,8}*)

ZIP code geo-coordinates

dsZIPCodes=ImportCSVToDataset["~/Datasets/USAZipCodes/US-Zip-Codes-from-2013-Government-Data.csv"];
    Dimensions[dsZIPCodes]

(*{33144,3}*)

aZipLatLon=Association[Normal[Query[#ZIP->{#LAT,#LON}&]/@dsZIPCodes]];
    aZipLatLon=KeyMap[ToString,aZipLatLon];
    Length[aZipLatLon]

(*33144*)

Summary

ResourceFunction["RecordsSummary"][dsZIPCodes]

image-572ef441-b14e-438d-b5b7-85f244aa1857

ResourceFunction["RecordsSummary"][aZipLatLon,Thread->True]

image-c0d4f154-ee22-457f-8a36-715b77c92e08

Pareto principle adherence

In this section we apply the Pareto principle statistic in order to see does the Pareto principle manifests over the different columns of the pet licensing data.

Breeds

We see a typical Pareto principle adherence for both dog breeds and cat breeds. For example, 20% of the dog breeds correspond to 80% of all registered dogs.

Note that the number of unique cat breeds is 4 times smaller than the number of unique dog breeds.

focusColumnName="Primary Breed";

dsQuery=Query[GroupBy[#Species&],GroupBy[#[focusColumnName]&],Length]@dsPetLicenses;
    dsQuery=Dataset[ReverseSort/@Normal[dsQuery]]

image-d1bac8f8-fe6c-42c0-8d52-45ed21ab6cc2

KeyValueMap[ResourceFunction["ParetoPrinciplePlot"][Values[#2],PlotLabel->Row[{#1,Spacer[3],focusColumnName}],ImageSize->Medium,opts]&,Normal[dsQuery]]

image-3c320985-1ed4-4d11-b983-29f87d4cdc7c

Animal names

We see a typical Pareto principle adherence for the frequencies of the pet names. For dogs, 10% of the unique names correspond to ~65% of the pets.

focusColumnName="Animal's Name";

dsQuery=Query[GroupBy[#Species&],GroupBy[#[focusColumnName]&],Length]@dsPetLicenses;
    dsQuery=Dataset[ReverseSort/@Normal[dsQuery]]

image-cb6368b6-b735-4f77-a3dd-bcb0be60f28e

KeyValueMap[ResourceFunction["ParetoPrinciplePlot"][Values[#2],PlotLabel->Row[{#1,Spacer[3],focusColumnName}],ImageSize->Medium,opts]&,Normal[dsQuery]]

image-bbcac6bb-5247-400c-a093-f3002206b5cf

Zip codes

We see typical – even exaggerated – manifestation of the Pareto principle over ZIP codes of the registered pets.

focusColumnName="ZIP Code";

dsQuery=Query[GroupBy[#Species&],GroupBy[#[focusColumnName]&],Length]@dsPetLicenses;
    dsQuery=Dataset[ReverseSort/@Normal[dsQuery]];

KeyValueMap[ResourceFunction["ParetoPrinciplePlot"][Values[#2],PlotLabel->Row[{#1,Spacer[3],focusColumnName}],ImageSize->Medium,opts]&,Normal[dsQuery]]

image-72cae8dd-d342-4c90-a11d-11607545133e

Geo-distribution

In this section we visualize the pets licensing geo-distribution with geo-histograms.

city=Entity["City",{"Seattle","Washington","UnitedStates"}];
    GeoPosition[city]

(*GeoPosition[{47.6205,-122.351}]*)

Both cats and dogs

lsCoords=Map[If[KeyExistsQ[aZipLatLon,#],aZipLatLon[#],Nothing]&,Select[ToString/@Normal[dsPetLicenses[All,"ZIP Code"]],StringQ[#]&&StringLength[#]>=5&]];

GeoHistogram[lsCoords,GeoCenter->city,GeoRange->Quantity[20,"Miles"],PlotLegends->Automatic,ColorFunction->(Hue[2/3,2/3,1-#]&),opts]

image-94ae1316-ada2-4195-b2fc-6864ff1fd642

Dogs and cats separately

lsCoords=Map[If[KeyExistsQ[aZipLatLon,#],aZipLatLon[#],Nothing]&,Select[ToString/@Normal[dsPetLicenses[Select[#Species=="Dog"&],"ZIP Code"]],StringQ[#]&&StringLength[#]>=5&]];
    gr1=GeoHistogram[lsCoords,GeoCenter->city,GeoRange->Quantity[20,"Miles"],PlotLegends->Automatic,ColorFunction->(Hue[2/3,2/3,1-#]&),opts];

lsCoords=Map[If[KeyExistsQ[aZipLatLon,#],aZipLatLon[#],Nothing]&,Select[ToString/@Normal[dsPetLicenses[Select[#Species=="Cat"&],"ZIP Code"]],StringQ[#]&&StringLength[#]>=5&]];
    gr2=GeoHistogram[lsCoords,GeoCenter->city,GeoRange->Quantity[20,"Miles"],PlotLegends->Automatic,ColorFunction->(Hue[2/3,2/3,1-#]&),opts];

ResourceFunction["GridTableForm"][{gr1,gr2},TableHeadings->{"Dogs","Cats"},Background->White]

image-836dff19-7000-45e0-b0a4-1f3fe4a066c9

Pet stores

In this subsection we show the distribution of pet stores (in Seattle).

It is better instead of image retrieval to show corresponding geo-markers in the geo-histograms above. (This is not considered that important in the first version of this notebook/document.)

WebImage["https://www.google.com/maps/search/pet+stores+in+Seattle,+WA/@47.6326975,-122.4227211,12.05z"]

Time series

In this section we visualize the time series corresponding to the pet registrations.

Time series objects

Here we make time series objects:

dsQuery=Query[GroupBy[#Species&],GroupBy[#DateObject&],Length]@dsPetLicenses;
    aTS=TimeSeries/@(List@@@Normal[#]&/@Normal[dsQuery])

image-49ae54cb-0644-427e-a015-0392284aaaa7

Time series plots of all registrations

Here are time series plots corresponding to all registrations:

DateListPlot[#,opts]&/@aTS

image-02632be6-ab52-41b8-959a-e200641fdd8f

Time series plots of most recent registrations

It is an interesting question why the number of registrations is much higher in volume and frequency in the years 2018 and later.

DateListPlot[TimeSeriesWindow[#,{{2017,1,1},{2020,1,1}}],opts]&/@aTS

image-85ebeab1-cad5-4fe3-bd5d-c7c8c94a753e

Upward trend

Here we apply both Linear Regression and Quantile Regression:

QRMonUnit[TimeSeriesWindow[#,{{2018,1,1},{2020,1,1}}]]⟹
    QRMonLeastSquaresFit[{1,x}]⟹
    QRMonQuantileRegressionFit[4,0.5]⟹
    QRMonDateListPlot[opts,"Echo"->False]⟹
    QRMonTakeValue&/@aTS

image-6df4d9d2-e48a-4d63-885c-6ed5112c0f15

We can see that there is clear upward trend for both dogs and cats.

Quantile regression application

In this section we investigate the possibility to simulate the pet adoption rate. We plan to use simulations of the pet adoption rate in PRDM.

We do that using the software monad QRMon, [AAp1]. A list of steps follows.

Split the time series into windows corresponding to the years 2018 and 2019.
Find the difference between the two years.
Apply Quantile Regression to the difference using a reasonable grid of probabilities.
Simulate the difference.
Add the simulated difference to year 2019.

Simulation

In this sub-section we simulate the differences between the time series for 2018 and 2019, then we add the simulated difference to the time series of the year 2019.

ts1=TimeSeriesResample[TimeSeriesWindow[aTS[[1]],{{2018,1,1},{2019,1,1}}],"Day"];
    ts1["Path"][[All,2]];

ts2=TimeSeriesResample[TimeSeriesWindow[aTS[[1]],{{2019,1,1},{2020,1,1}}],"Day"];
    ts2["Path"][[All,2]];

ts3=TimeSeries[Transpose[{ts1["Path"][[All,1]],ts2["Path"][[All,2]]-ts1["Path"][[All,2]]}]]

image-8f9e3af0-46b7-4417-bd1e-3201c1134f34

qrObj=
    QRMonUnit[ts3]⟹
    QRMonEchoDataSummary⟹
    QRMonQuantileRegression[20,Join[Range[0.1,0.9,0.1],{0.03,0.93}],InterpolationOrder->2]⟹
    QRMonDateListPlot[opts];

image-30b836dc-f166-4f21-9c0b-9cca922058e6

image-65e4d1bf-dfff-4073-88a0-63177eeed1b5

qrObj=
    qrObj⟹
    QRMonEchoFunctionContext[DateListPlot[#data,PlotLabel->"Original data",opts]&]⟹
    QRMonSimulate[ts2["Path"]//Length]⟹
    QRMonEchoFunctionValue[DateListPlot[#,PlotLabel->"Simulated data",opts]&];

image-6d107cad-6fef-46c8-92a8-59ea78b5039f

image-d0d517e0-925b-486c-88fd-287cfe02e799

Take the simulated time series difference:

tsSimDiff=TimeSeries[qrObj⟹QRMonTakeValue];

Add the simulated time series difference to year 2019, clip the values less than zero, shift the result to 2020:

tsSim=MapThread[{#1[[1]],#1[[2]]+#2[[2]]}&,{ts2["Path"],tsSimDiff["Path"]}];
    tsSim[[All,2]]=Clip[tsSim[[All,2]],{0,Max[tsSim[[All,2]]]}];
    tsSim=TimeSeriesShift[TimeSeries[tsSim],Quantity[365,"Days"]];
    DateListPlot[tsSim,opts]

image-2a29feca-73b8-4fce-8051-145d74ec499c

Plot all years together

DateListPlot[{ts1,ts2,tsSim},opts,PlotLegends->{2018,2019,2020}]

image-793f146a-07f9-455f-9bc7-2ef7d7897691

Discussion

This section has subsections that correspond to additional discussion questions. Not all questions are answered, the plan is to progressively answer the questions with the subsequent versions of the this notebook / document.

□ Too few pets

The number of registered pets seems too few. Seattle is a large city with more than 600000 citizens; approximately 50% of the USA households have dogs; hence the registered pets are too few (~50000).

□ Why too few pets?

Seattle is a high tech city and its citizens are too busy to have pets?

Most people do not register their pets? (Very unlikely if they have used veterinary services.)

Incomplete data?

□ Registration rates

Why the number of registrations is much higher in volume and frequency in the years 2018 and later?

□ Adoption rates

Can we tell apart the adoption rates of pet-less people and people who already have pets?

Preliminary definitions

opts=Sequence@@{PlotRange->All,ImageSize->Medium,PlotTheme->"Detailed"};

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicQuantileRegression.m"]

References

[AA1] Anton Antonov, Pets retail dynamics project, (2020), SystemModeling at GitHub.

[AA2] Anton Antonov, A monad for Quantile Regression workflows, (2018), MathematicaForPrediction at WordPress.

[AAp1] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediction at GitHub.

[SOD1] Seattle Open Data, “Seattle Pet Licenses”, https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb/data .

Wolfram Live-Coding Series on Quantile Regression workflows

Posted on September 18, 2019 by Anton Antonov Antonov

A month or so ago I was invited to make Quantile Regression presentations at the Wolfram Research Twitch channel.

The live-coding sessions

In the first
live-streaming / live-coding session
I demonstrated how to make
Quantile Regression
workflows using the software monad QRMon
and some of the underlying software design principles. (Namely
“monadic programming”.)
In the follow up live-coding session I discussed topics like outliers removal (data cleaning), anomaly detection, and structural breaks.
In the third live-coding session:
- First, we demonstrate and explain how to do QR-based time series simulations and their applications in Operations Research.
- Next, we discuss QR in 2D and 3D and a related application.
In the fourth live-coding session we discussed the following the topics.
- Brief review of previous sessions.
- Proclaiming the upcoming ResourceFunction["QuantileRegression"].
- Predict tomorrow from today’s data.
- Using NLP techniques on time series.
- Generation of QR workflows with natural language commands.

Notebooks

The notebooks are available at https://community.wolfram.com/groups/-/m/t/1787896 .

I did not save the notebook I made during the first live coding session, but I attached to that WRI community post a modified version of the notebook I used for a Meetup workshop 6-7 months ago. (See for more details
the GitHub MathematicaVsR project “Quantile Regression Workflows”.)

The notebook of the 2nd session is also attached. (I added a “References” section to it.)

Update (2019-11-13)

A few days ago the Wolfram Function Repository entry QuantileRegression was approved. The resource description page has many of the topics discussed in the live-coding sessions on Quantile regression.

	数字为2026 年不卓越但很快乐 - 偏… on Numerically 2026 is unremarkab…
	Day 22 – Numer… on Primitive roots generation…
	Numerically 2026 is… on Primitive roots generation…
	Day 9 – Monadi… on Monad code generation and…
	Monad laws in Raku \|… on Monad code generation and…

Introduction

Data ingestion

Grammar and parsers

Plotting the clock

More robust parsing

Fuzzy matching

Parsing of numeric word forms

Enhance with LLM parsing

Robust parser demo

Timeline

Parsing page data

Using LLM-extraction instead

Timeline plot

Conclusion

Setup

References

Articles, notebooks

Paclets

Videos

Introduction

References

Data ingestion

Distributions

Pareto principle manifestation

Number of creations

Popularity

Correlations

Bubble chart 2D

Bubble chart 3D

Second system effect traces

Instead of conclusions

Introduction

Cryptocurrencies data

Significant cryptocurrencies

By summaries

By price

By trading volume

Related events

Cryptocurrency time series with events

Observations

Going down

Going up

Time series correlations

Getting time series data

Opening price time series

Trading volume time series

Observations

Nearest neighbors graphs

Graphs overview

Other types of analysis

Load packages

Definitions

References

Articles

Packages

Resource functions

Introduction

The overall process

Data

The Apple data

The New York Times data

The search engines

Structure

Implementation

Examples

Apple data search engine examples

Nice, France vs Florida, USA

Italy and Balkan countries driving

The New York Times data search engine examples

References

Data

Articles

Packages, repositories

Introduction

Import data

NYTimes USA states data

NYTimes USA counties data

US county records

Merge data

Basic data analysis