<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.2.2">Jekyll</generator><link href="https://kakaoenterprise.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://kakaoenterprise.github.io/" rel="alternate" type="text/html" /><updated>2023-05-31T03:04:47-05:00</updated><id>https://kakaoenterprise.github.io/feed.xml</id><title type="html">카카오엔터프라이즈 AI Research</title><subtitle>카카오엔터프라이즈 AI Lab에서 발표한 AI 논문과 연구 성과를 소개합니다.</subtitle><author><name>카카오엔터프라이즈</name></author><entry><title type="html">FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs</title><link href="https://kakaoenterprise.github.io/papers/interspeech-fastfit" rel="alternate" type="text/html" title="FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs" /><published>2023-08-20T00:00:00-05:00</published><updated>2023-08-20T00:00:00-05:00</updated><id>https://kakaoenterprise.github.io/papers/interspeech-fastfit</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/interspeech-fastfit"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>This paper presents FastFit, a novel neural vocoder architec- ture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs) to achieve faster generation rates without sacrificing sample quality. We replaced each encoder block with an STFT, with parameters equal to the temporal resolution of each decoder block, leading to the skip connection. FastFit reduces the number of parameters and the generation time of the model by almost half while maintaining high fidelity. Through objective and subjective evaluations, we demonstrated that the proposed model achieves nearly twice the generation speed of baseline iteration-based vocoders while maintaining high sound quality. We further showed that FastFit produces sound qualities similar to those of other baselines in text-to-speech evaluation scenarios, including multi-speaker and zero-shot text-to-speech.</p>]]></content><author><name>taylor:카카오엔터프라이즈</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry><entry><title type="html">On the Importance of Feature Decorrelation for Unsupervised Representation Learning in Reinforcement Learning</title><link href="https://kakaoenterprise.github.io/papers/icml-rl" rel="alternate" type="text/html" title="On the Importance of Feature Decorrelation for Unsupervised Representation Learning in Reinforcement Learning" /><published>2023-07-23T00:00:00-05:00</published><updated>2023-07-23T00:00:00-05:00</updated><id>https://kakaoenterprise.github.io/papers/ICML-rl</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/icml-rl"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>Recently, unsupervised representation learning (URL) has improved the sample efficiency of Reinforcement Learning (RL) by pretraining a model from a large unlabeled dataset. The underlying principle of these methods is to learn temporally predictive representations by predicting future states in the latent space. However, an important challenge of this approach is the representational collapse, where the subspace of the latent representations collapses into a low-dimensional manifold. To address this issue, we propose a novel URL framework that causally predicts future states while increasing the dimension of the latent manifold by decorrelating the features in the latent space. Through extensive empirical studies, we demonstrate that our framework effectively learns predictive representations without collapse, which significantly improves the sample efficiency of state-of-the-art URL methods on the Atari 100k benchmark. The code is available at https://github.com/dojeon-ai/SimTPR.</p>]]></content><author><name>이호준:카이스트, 카카오엔터프라이즈</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry><entry><title type="html">Local 3D Editing via 3D Distillation of CLIP Knowledge</title><link href="https://kakaoenterprise.github.io/papers/cvpr-lenerf" rel="alternate" type="text/html" title="Local 3D Editing via 3D Distillation of CLIP Knowledge" /><published>2023-06-20T00:00:00-05:00</published><updated>2023-06-20T00:00:00-05:00</updated><id>https://kakaoenterprise.github.io/papers/CVPR-LENeRF</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/cvpr-lenerf"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>3D content manipulation is an important computer vision task with many real-world applications (e.g., product design, cartoon generation, and 3D Avatar editing). Recently proposed 3D GANs can generate diverse photorealistic 3D-aware contents using Neural Radiance fields (NeRF). However, manipulation of NeRF still remains a challenging problem since the visual quality tends to degrade after manipulation and suboptimal control handles such as 2D semantic maps are used for manipulations. While text-guided manipulations have shown potential in 3D editing, such approaches often lack locality. To overcome these problems, we propose Local Editing NeRF (LENeRF), which only requires text inputs for fine-grained and localized manipulation. Specifically, we present three add-on modules of LENeRF, the Latent Residual Mapper, the Attention Field Network, and the Deformation Network, which are jointly used for local manipulations of 3D features by estimating a 3D attention field. The 3D attention field is learned in an unsupervised way, by distilling the zero-shot mask generation capability of CLIP to the 3D space with multi-view guidance. We conduct diverse experiments and thorough evaluations both quantitatively and qualitatively.</p>]]></content><author><name>형준하:카이스트 AI, 카카오엔터프라이즈</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry><entry><title type="html">Revisiting the Importance of Amplifying Bias for Debiasing</title><link href="https://kakaoenterprise.github.io/papers/aaai-debiasing" rel="alternate" type="text/html" title="Revisiting the Importance of Amplifying Bias for Debiasing" /><published>2023-02-07T00:00:00-06:00</published><updated>2023-02-07T00:00:00-06:00</updated><id>https://kakaoenterprise.github.io/papers/AAAI-debiasing</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/aaai-debiasing"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>In image classification, debiasing aims to train a classifier to be less susceptible to dataset bias, the strong correlation between peripheral attributes of data samples and a target class. For example, even if the frog class in the dataset mainly consists of frog images with a swamp background (i.e., biasaligned samples), a debiased classifier should be able to correctly classify a frog at a beach (i.e., bias-conflicting samples). Recent debiasing approaches commonly use two components for debiasing, a biased model f<sub>B</sub> and a debiased model f<sub>D</sub>. f<sub>B</sub> is trained to focus on bias-aligned samples (i.e., overfitted to the bias) while f<sub>D</sub> is mainly trained with bias-conflicting samples by concentrating on samples which f<sub>B</sub> fails to learn, leading f<sub>D</sub> to be less susceptible to the dataset bias. While the state-of-the-art debiasing techniques have aimed to better train f<sub>D</sub>, we focus on training f<sub>B</sub>, an overlooked component until now. Our empirical analysis reveals that removing the bias-conflicting samples from the training set for f<sub>B</sub> is important for improving the debiasing performance of f<sub>D</sub>. This is due to the fact that the biasconflicting samples work as noisy samples for amplifying the bias for f<sub>B</sub> since those samples do not include the bias attribute. To this end, we propose a simple yet effective data sample selection method which removes the bias-conflicting samples to construct a bias-amplified dataset for training f<sub>B</sub>. Our data sample selection method can be directly applied to existing reweighting-based debiasing approaches, obtaining consistent performance boost and achieving the state-of-theart performance on both synthetic and real-world datasets.</p>]]></content><author><name>이정수:카이스트, 카카오엔터프라이즈</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry><entry><title type="html">Efficient Skeleton-Based Action Recognition via Joint-Mapping strategies</title><link href="https://kakaoenterprise.github.io/papers/wacv-action-recognition" rel="alternate" type="text/html" title="Efficient Skeleton-Based Action Recognition via Joint-Mapping strategies" /><published>2023-01-03T00:00:00-06:00</published><updated>2023-01-03T00:00:00-06:00</updated><id>https://kakaoenterprise.github.io/papers/wacv-action-recognition</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/wacv-action-recognition"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>Graph convolutional networks (GCNs) have brought remarkable progress in skeleton-based action recognition. However, high computational cost and large model size make models difficult to be applied in real-world embedded system. Specifically, GCN that is applied in automated surveillance system pre-require models such as pedestrian detection and human pose estimation. Therefore, each model should be computationally lightweight and whole process should be operated in real-time. In this paper, we propose two different joint-mapping modules to reduce the number of joint representations, alleviating a total computational cost and model size. Our models achieve better accuracy-latency trade-off compared to previous state-of-the-arts on two datasets, namely NTU RGB+D and NTU RGB+D 120, demonstrating the suitability for practical applications. Furthermore, we measure the latency of the models by using TensorRT framework to compare the models from a practical perspective.</p>]]></content><author><name>marcus:카카오엔터프라이즈</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry><entry><title type="html">Normalizing Mutual Information for Robust Adaptive Training for Translation</title><link href="https://kakaoenterprise.github.io/papers/emnlp-translation" rel="alternate" type="text/html" title="Normalizing Mutual Information for Robust Adaptive Training for Translation" /><published>2022-12-07T00:00:00-06:00</published><updated>2022-12-07T00:00:00-06:00</updated><id>https://kakaoenterprise.github.io/papers/emnlp-translation</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/emnlp-translation"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>Despite the success of neural machine translation models, tensions between fluency of optimizing target language modeling and sourcefaithfulness remain as challenges. Previously, Conditional Bilingual Mutual Information (CBMI), a scoring metric for the importance of target sentences and tokens, was proposed to encourage fluent and faithful translations. The score is obtained by combining the probability from the translation model and the target language model, which is then used to assign different weights to losses from sentences and tokens. Meanwhile, we argue this metric is not properly normalized, for which we propose Normalized Pointwise Mutual Information (NPMI). NPMI utilizes an additional language model on source language to approximate the joint likelihood of source-target pair and the likelihood of the source, which is then used for normalizing the score. We showed that NPMI better captures the dependence between source-target and that NPMI-based token-level adaptive training brings improvements over baselines with empirical results from En-De, De-En, and En-Ro translation tasks.</p>]]></content><author><name>이영원:서울대</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry><entry><title type="html">LittleBird: Efficient Faster &amp;amp; Longer Transformer for Question Answering</title><link href="https://kakaoenterprise.github.io/papers/emnlp-littlebird" rel="alternate" type="text/html" title="LittleBird: Efficient Faster &amp;amp; Longer Transformer for Question Answering" /><published>2022-12-07T00:00:00-06:00</published><updated>2022-12-07T00:00:00-06:00</updated><id>https://kakaoenterprise.github.io/papers/emnlp-littlebird</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/emnlp-littlebird"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>BERT has shown a lot of sucess in a wide variety of NLP tasks. But it has a limitation dealing with long inputs due to its attention mechanism. Longformer, ETC and BigBird addressed this issue and effectively solved the quadratic dependency problem. However we find that these models are not sufficient, and propose LittleBird, a novel model based on BigBird with improved speed and memory footprint while maintaining accuracy. In particular, we devise a more flexible and efficient position representation method based on Attention with Linear Biases (ALiBi). We also show that replacing the method of global information represented in the BigBird with pack and unpack attention is more effective. The proposed model can work on long inputs even after being pre-trained on short inputs, and can be trained efficiently reusing existing pretrained language model for short inputs. This is a significant benefit for low-resource languages where large amounts of long text data are difficult to obtain. As a result, our experiments show that LittleBird works very well in a variety of languages, achieving high performance in question answering tasks, particularly in KorQuAD2.0, Korean Question Answering Dataset for long paragraphs.</p>]]></content><author><name>phil:카카오엔터프라이즈</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry><entry><title type="html">APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation Datasets</title><link href="https://kakaoenterprise.github.io/papers/emnlp-apeach" rel="alternate" type="text/html" title="APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation Datasets" /><published>2022-12-07T00:00:00-06:00</published><updated>2022-12-07T00:00:00-06:00</updated><id>https://kakaoenterprise.github.io/papers/emnlp-apeach</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/emnlp-apeach"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>In hate speech detection, developing training and evaluation datasets across various domains is the critical issue. Whereas, major approaches crawl social media texts and hire crowd-workers to annotate the data. Following this convention often restricts the scope of pejorative expressions to a single domain lacking generalization. Sometimes domain overlap between training corpus and evaluation set overestimate the prediction performance when pretraining language models on low-data language. To alleviate these problems in Korean, we propose APEACH that asks unspecified users to generate hate speech examples followed by minimal post-labeling. We find that APEACH can collect useful datasets that are less sensitive to the lexical overlaps between the pretraining corpus and the evaluation set, thereby properly measuring the model performance.</p>]]></content><author><name>양기창:카카오, 카카오엔터프라이즈, 숭실대</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry><entry><title type="html">Persona-Knowledge Dialogue Multi-Context Retrieval and Enhanced Decoding Methods</title><link href="https://kakaoenterprise.github.io/papers/coling-multi-context-retrieval" rel="alternate" type="text/html" title="Persona-Knowledge Dialogue Multi-Context Retrieval and Enhanced Decoding Methods" /><published>2022-10-16T00:00:00-05:00</published><updated>2022-10-16T00:00:00-05:00</updated><id>https://kakaoenterprise.github.io/papers/Coling-multi-context-retrieval</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/coling-multi-context-retrieval"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>Persona and Knowledge dual context opendomain chat is a novel dialogue generation task introduced recently (Jang et al., 2021). While Persona and Knowledge is each interesting context of open-domain dialogue, the combination of both has not been well studied. We tackle Persona-Knowledge identification and response generation tasks in this paper. We design an informed data augmentation strategy that is compatible with neural Q&amp;A retrieval models. With the augmented data, we perform permutative Persona-Knowledge evaluation and successive Persona search fine-tuning. Furthermore, we perform dialogue generation with various decoding techniques and illustrate crucial elements. We achieve SOTA across official metrics with 93.99% Grounding accuracy average and 23.62 SacreBLEU score.</p>]]></content><author><name>오민식:Alexa AI</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry><entry><title type="html">Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers</title><link href="https://kakaoenterprise.github.io/papers/interspeech-rnn-t" rel="alternate" type="text/html" title="Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers" /><published>2022-09-18T00:00:00-05:00</published><updated>2022-09-18T00:00:00-05:00</updated><id>https://kakaoenterprise.github.io/papers/interspeech-rnn-t</id><content type="html" xml:base="https://kakaoenterprise.github.io/papers/interspeech-rnn-t"><![CDATA[<h1 id="abstract">Abstract</h1>

<p>Recurrent neural network transducer (RNN-T) is an end-to-end speech recognition framework converting input acoustic frames into a character sequence. The state-of-the-art encoder network for RNN-T is the Conformer, which can effectively model the local-global context information via its convolution and self-attention layers. Although Conformer RNN-T has shown outstanding performance, most studies have been verified in the setting where the train and test data are drawn from the same domain. The domain mismatch problem for Conformer RNN-T has not been intensively investigated yet, which is an important issue for the product-level speech recognition system. In this study, we identified that fully connected self-attention layers in the Conformer caused high deletion errors, specifically in the long-form out-domain utterances. To address this problem, we introduce sparse self-attention layers for Conformer-based encoder networks, which can exploit local and generalized global information by pruning most of the in-domain fitted global connections. Also, we propose a state reset method for the generalization of the prediction network to cope with long-form utterances. Applying proposed methods to an out-domain test, we obtained 27.6% relative character error rate (CER) reduction compared to the fully connected self-attention layer-based Conformers.</p>]]></content><author><name>김준태:SK텔레콤</name></author><category term="papers" /><summary type="html"><![CDATA[Abstract]]></summary></entry></feed>