Stefan Krawczyk

San Francisco, California, United States
7K followers 500+ connections

Join to view profile

About

With over 10 years of experience in building and leading data & ML related systems and…

Articles by Stefan

  • February Updates

    TL;DR: #Hamilton highlights: crossed 2000 github stars, released multithreading based DAG parallelism, RichProgressBar…

    3 Comments
  • Last week of 2024 / first week of 2025

    TL;DR: #Hamilton + #Burr 2024 stats: 35M+ telemetry events (10x), 100K+ unique IPs (10x) from 1000+ companies, 1M+…

    3 Comments
  • Week of December 9th

    TL;DR: #Hamilton release highlights: Better TypedDict support and modular subdag example Office Hours & Meet ups for…

  • Week of December 2nd

    TL;DR: #Hamilton release highlights: Async Datadog Integration, Polars & Pandas with_columns support. #Burr release…

  • Week of November 18th

    TL;DR: #Hamilton release highlights: SDK configurability #Burr release highlights: parallelism UI modifications, video…

  • Week of November 11th

    TL;DR: #Hamilton release highlights: async support for @pipe + various small fixes #Burr release highlights:…

  • Week of November 4th

    TL;DR: #Hamilton release highlights: @with_columns decorator for Pandas by Jernej Frank & module overrides for async…

  • Week of October 28th

    TL;DR: #Hamilton release highlights: in-memory cache store. #Burr release highlights: release candidate for a first…

  • Week of October 21st

    TL;DR: #Hamilton release highlights: some minor fixes and docs updates from five different OS contributors! Also…

  • Week of October 14th

    TL;DR: Announcing Shreya Shankar as an advisor. #Hamilton release highlights: tweaks to pipe_input, new…

    3 Comments

Activity

7K followers

See all activities

Experience & Education

  • Salesforce

View Stefan’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • Hamilton: a modular open source declarative paradigm for high level modeling of dataflows

    1st International Workshop on Composable Data Management Systems, CDMS@VLDB 2022, Sydney, Australia, September 9, 2022

    https://cdmsworkshop.github.io/2022/Proceedings/ShortPapers/Paper6\_StefanKrawczyk.pdf

    Other authors
    See publication
  • Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs

    1st International Workshop on Data Ecosystems co-located with 48th International Conference on Very Large Databases (VLDB 2022

    https://ceur-ws.org/Vol-3306/paper5.pdf

    Other authors
    See publication
  • Citation-based bootstrapping for large-scale author disambiguation

    Journal of the American Society for Information Science and Technology

    Work that I did with the NLP group.
    Abstract: We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first “bootstrap” stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised…

    Work that I did with the NLP group.
    Abstract: We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first “bootstrap” stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters' Web of Knowledge with B3 F1 of.807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self-citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from.748 to.767) suggests they may not be worth the trouble of extracting from databases that don't already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F1.

    Other authors
    See publication
  • Probabilistic Ontology Trees for Belief Tracking in Dialog Systems

    Proceedings of SIGDIAL 2010: the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue

    Wrote the manual system they used here.

    Other authors
    See publication
  • Investigating SMS Text Normalization using Statistical Machine Translation

    CS224N Stanford University, Stanford, CA, 2009.

    Class project on using statistical machine translation to convert SMS messages in 'text speak' back into normal English. We've had many requests for our data set and have had our project cited a couple times!

    Other authors
    See publication
  • Grid resource allocation: allocation mechanisms and utilisation patterns

    Proceedings of the sixth Australasian workshop on Grid computing and e-research - Volume 82

    Conference paper based off my honours thesis.

    Other authors
    See publication

Patents

  • Belief tracking and action selection in spoken dialog systems

    US 8676583

    An action is performed in a spoken dialog system in response to a user's spoken utterance. A policy which maps belief states of user intent to actions is retrieved or created. A belief state is determined based on the spoken utterance, and an action is selected based on the determined belief state and the policy. The action is performed, and in one embodiment, involves requesting clarification of the spoken utterance from the user. Creating a policy may involve simulating user inputs and spoken…

    An action is performed in a spoken dialog system in response to a user's spoken utterance. A policy which maps belief states of user intent to actions is retrieved or created. A belief state is determined based on the spoken utterance, and an action is selected based on the determined belief state and the policy. The action is performed, and in one embodiment, involves requesting clarification of the spoken utterance from the user. Creating a policy may involve simulating user inputs and spoken dialog system interactions, and modifying policy parameters iteratively until a policy threshold is satisfied. In one embodiment, a belief state is determined by converting the spoken utterance into text, assigning the text to one or more dialog slots associated with nodes in a probabilistic ontology tree (POT), and determining a joint probability based on probability distribution tables in the POT and on the dialog slot assignments.

    Other inventors
    See patent
  • TEAM MEMBER RECOMMENDATION SYSTEM

    US

Courses

  • Machine Learning

    CS229

  • Natural Language Processing

    CS224N

  • Natural Language Understanding

    CS224U

  • Speech Recognition and Synthesis

    CS224S

Projects

  • Algorithms Tech Branding

    - Present

    A self organized group managing https://multithreaded.stitchfix.com/algorithms/blog/ and branding for the Algorithms organization.

    Other creators
    See project
  • Nextdoor Feature Config

    Designed and Implemented a library for rolling out new features at Nextdoor.

    This utilized our open source zookeeper library (https://github.com/Nextdoor/ndserviceregistry) to handle storing and accessing feature configurations.

  • R3

    In the theme of grassroots innovation I initiated & lead an effort to give everyone at Nextdoor the time to work on anything they wanted. Giving people the time to do anything can be scary, so to align & orient everyone, the team came up with some steps that people could refer to: "Reflect, Reinvent & Refine", and hence the name R3.

  • Structured Application Logging

    1) wrote a python log handler that converted application logs to structured json objects for easier ingestion and consumption downstream.
    2) implemented ingestion and consumption of structured logs using Apache Flume, linking to elastic search and s3.

  • Nextdoor.com Holiday Lights Map

    During the holiday season, Nextdoor members are able to add themselves to a map of their neighborhood indicating the homes with holiday lights and upload festive photos and messages.

    Other creators
    See project
  • Linkedin Idea Bank

    - Present

    The Linkedin Idea Bank is a way for Linkedin employees to reach the entire Linkedin organization with their ideas, find like minded people, iterate on their ideas, and allow people to track their progress.

    It sports a Quora meets Pinterest type interface, runs on Play 2.0 with MongoDB on the backend.

    Other creators
    See project
  • [in]cubator

    - Present

    Worked on program to bring up to 90 days worth of time for employees to spend polishing their hacks.

    Specifically I helped form the committee, philosophy, and drove building the internal website that now drives grass roots innovation at Linkedin.

    Other creators
    See project
  • Trunkstats

    -

    For hackday I wrote a tool to keep metrics on the status of trunk here at Linkedin. This tool was a simple website using the play framework and google charts to show trunk health and checkin trends. It was used weekly for engineering status reports by the VP of engineering. It fell out of use 6 months later when the tools team finally caught up and wrote their own more integrated tool.

  • [in]sightful

    -

    For hackday I mashed together the feedback that comes into the site with the user's profile data, and stuck that into lucene. I used Linkedin's addons Bobo & Zoie to add real time indexing and faceted search. So rather than doing text search in an email client, people could do a full text search with lucene and also slice and dice by profile data.

  • Stanford Masters Admissions Committee 2010

    -

    Was one of the student representatives on the Stanford CS Master's Admissions Committee.

  • SMS Text Normalization

    -

    A system for converting textspeak (language used in SMS communication) to proper English using statistical machine translation. Presented in the PhD Poster Session of Stanford Computer Forum's annual affiliates meeting in April 2010. Poster available at http://forum.stanford.edu/events/posterslides/SMSTextNormalizationusingStatisticalMachineTranslation.pdf.

    Other creators
    See project

Languages

  • English

    Native or bilingual proficiency

  • Polish

    Professional working proficiency

  • Japanese

    Elementary proficiency

  • Spanish

    Elementary proficiency

Organizations

  • VUW Handball Club

    President, Webmaster, VUWSA Club Sports Council Representative

    -

    Was a founding member and helped run the club. Was president for one year.

  • VUWSA Blues Panel

    Elected Member

    -

    This position involved reviewing applications, debating and determining VUW sports awards for student athletes.

  • VUWSA Sports Council

    Elected Member

    -

    Dealt with sports related topics concerning the VUWSA.

View Stefan’s full profile

  • See who you know in common
  • Get introduced
  • Contact Stefan directly
Join to view full profile

Explore top content on LinkedIn

Find curated posts and insights for relevant topics all in one place.

View top content

Add new skills with these courses