About
Articles by Stefan
Activity
7K followers
Experience & Education
Publications
-
Hamilton: a modular open source declarative paradigm for high level modeling of dataflows
1st International Workshop on Composable Data Management Systems, CDMS@VLDB 2022, Sydney, Australia, September 9, 2022
https://cdmsworkshop.github.io/2022/Proceedings/ShortPapers/Paper6\_StefanKrawczyk.pdf
Other authorsSee publication -
Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs
1st International Workshop on Data Ecosystems co-located with 48th International Conference on Very Large Databases (VLDB 2022
-
Citation-based bootstrapping for large-scale author disambiguation
Journal of the American Society for Information Science and Technology
Work that I did with the NLP group.
Abstract: We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first “bootstrap” stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised…Work that I did with the NLP group.
Abstract: We present a new, two-stage, self-supervised algorithm for author disambiguation in large bibliographic databases. In the first “bootstrap” stage, a collection of high-precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature-based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self-supervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters' Web of Knowledge with B3 F1 of.807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self-citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from.748 to.767) suggests they may not be worth the trouble of extracting from databases that don't already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F1.Other authorsSee publication -
Probabilistic Ontology Trees for Belief Tracking in Dialog Systems
Proceedings of SIGDIAL 2010: the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue
-
Investigating SMS Text Normalization using Statistical Machine Translation
CS224N Stanford University, Stanford, CA, 2009.
Class project on using statistical machine translation to convert SMS messages in 'text speak' back into normal English. We've had many requests for our data set and have had our project cited a couple times!
Other authorsSee publication -
Grid resource allocation: allocation mechanisms and utilisation patterns
Proceedings of the sixth Australasian workshop on Grid computing and e-research - Volume 82
Patents
-
Belief tracking and action selection in spoken dialog systems
US 8676583
An action is performed in a spoken dialog system in response to a user's spoken utterance. A policy which maps belief states of user intent to actions is retrieved or created. A belief state is determined based on the spoken utterance, and an action is selected based on the determined belief state and the policy. The action is performed, and in one embodiment, involves requesting clarification of the spoken utterance from the user. Creating a policy may involve simulating user inputs and spoken…
An action is performed in a spoken dialog system in response to a user's spoken utterance. A policy which maps belief states of user intent to actions is retrieved or created. A belief state is determined based on the spoken utterance, and an action is selected based on the determined belief state and the policy. The action is performed, and in one embodiment, involves requesting clarification of the spoken utterance from the user. Creating a policy may involve simulating user inputs and spoken dialog system interactions, and modifying policy parameters iteratively until a policy threshold is satisfied. In one embodiment, a belief state is determined by converting the spoken utterance into text, assigning the text to one or more dialog slots associated with nodes in a probabilistic ontology tree (POT), and determining a joint probability based on probability distribution tables in the POT and on the dialog slot assignments.
Other inventorsSee patent -
TEAM MEMBER RECOMMENDATION SYSTEM
US
Courses
-
Machine Learning
CS229
-
Natural Language Processing
CS224N
-
Natural Language Understanding
CS224U
-
Speech Recognition and Synthesis
CS224S
Projects
-
Algorithms Tech Branding
- Present
A self organized group managing https://multithreaded.stitchfix.com/algorithms/blog/ and branding for the Algorithms organization.
Other creatorsSee project -
Nextdoor Feature Config
Designed and Implemented a library for rolling out new features at Nextdoor.
This utilized our open source zookeeper library (https://github.com/Nextdoor/ndserviceregistry) to handle storing and accessing feature configurations. -
R3
In the theme of grassroots innovation I initiated & lead an effort to give everyone at Nextdoor the time to work on anything they wanted. Giving people the time to do anything can be scary, so to align & orient everyone, the team came up with some steps that people could refer to: "Reflect, Reinvent & Refine", and hence the name R3.
-
Structured Application Logging
1) wrote a python log handler that converted application logs to structured json objects for easier ingestion and consumption downstream.
2) implemented ingestion and consumption of structured logs using Apache Flume, linking to elastic search and s3. -
Nextdoor.com Holiday Lights Map
During the holiday season, Nextdoor members are able to add themselves to a map of their neighborhood indicating the homes with holiday lights and upload festive photos and messages.
Other creatorsSee project -
Linkedin Idea Bank
- Present
The Linkedin Idea Bank is a way for Linkedin employees to reach the entire Linkedin organization with their ideas, find like minded people, iterate on their ideas, and allow people to track their progress.
It sports a Quora meets Pinterest type interface, runs on Play 2.0 with MongoDB on the backend.Other creatorsSee project -
[in]cubator
- Present
Worked on program to bring up to 90 days worth of time for employees to spend polishing their hacks.
Specifically I helped form the committee, philosophy, and drove building the internal website that now drives grass roots innovation at Linkedin.
Other creatorsSee project -
Trunkstats
-
For hackday I wrote a tool to keep metrics on the status of trunk here at Linkedin. This tool was a simple website using the play framework and google charts to show trunk health and checkin trends. It was used weekly for engineering status reports by the VP of engineering. It fell out of use 6 months later when the tools team finally caught up and wrote their own more integrated tool.
-
[in]sightful
-
For hackday I mashed together the feedback that comes into the site with the user's profile data, and stuck that into lucene. I used Linkedin's addons Bobo & Zoie to add real time indexing and faceted search. So rather than doing text search in an email client, people could do a full text search with lucene and also slice and dice by profile data.
-
Stanford Masters Admissions Committee 2010
-
Was one of the student representatives on the Stanford CS Master's Admissions Committee.
-
SMS Text Normalization
-
A system for converting textspeak (language used in SMS communication) to proper English using statistical machine translation. Presented in the PhD Poster Session of Stanford Computer Forum's annual affiliates meeting in April 2010. Poster available at http://forum.stanford.edu/events/posterslides/SMSTextNormalizationusingStatisticalMachineTranslation.pdf.
Other creatorsSee project
Languages
-
English
Native or bilingual proficiency
-
Polish
Professional working proficiency
-
Japanese
Elementary proficiency
-
Spanish
Elementary proficiency
Organizations
-
VUW Handball Club
President, Webmaster, VUWSA Club Sports Council Representative
-Was a founding member and helped run the club. Was president for one year.
-
VUWSA Blues Panel
Elected Member
-This position involved reviewing applications, debating and determining VUW sports awards for student athletes.
-
VUWSA Sports Council
Elected Member
-Dealt with sports related topics concerning the VUWSA.
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top content